CN110534113B

CN110534113B - Audio data desensitization method, device, equipment and storage medium

Info

Publication number: CN110534113B
Application number: CN201910790391.4A
Authority: CN
Inventors: 石真; 付嘉懿
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2021-08-24
Anticipated expiration: 2039-08-26
Also published as: CN110534113A

Abstract

The application relates to an audio data desensitization method, an audio data desensitization device, equipment and a storage medium, wherein a terminal performs voice recognition on audio data to obtain text data corresponding to the audio data and corresponding relations between text fragments and audio fragments in the text data, performs semantic recognition on the text data by using a preset sensitive information recognition model, obtains a sensitive text fragment set through the semantic recognition, performs desensitization processing on the audio data according to the sensitive text fragment set and the corresponding relations between the text fragments and the audio fragments to obtain desensitized audio data, and automatically obtains all the steps in the process of performing voice desensitization processing on the audio data, so that the process of manually desensitizing the audio data is avoided, and the efficiency of desensitization of the audio data is improved.

Description

Audio data desensitization method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technology, and in particular, to a method, an apparatus, a device, and a storage medium for desensitizing audio data.

Background

With the continuous development of society, information interaction through audio data becomes a common communication mode. For example, a user sends a piece of audio data to other users through social software, so that the other users know the information which the user wants to express through the audio data. However, according to the regulations of the relevant laws and regulations, and considering the privacy of the user, some information is not suitable for dissemination, the information is defined as sensitive words, and the process of removing the sensitive words in the audio data is called voice desensitization.

The common voice desensitization is to play audio data so that related personnel can judge whether a sensitive word exists in the audio data according to heard information, and when the related personnel determine that the sensitive word exists in the audio data, the related personnel find a time period corresponding to the sensitive word in the audio data and delete the audio in the time period.

However, when the data amount of the audio data is large, the conventional voice desensitization method is inefficient.

Disclosure of Invention

In view of the above, there is a need to provide an audio data desensitization method, apparatus, device and storage medium to address the inefficiency of conventional voice desensitization methods.

In a first aspect, a method of desensitizing audio data, the method comprising:

performing voice recognition on the audio data to obtain text data corresponding to the audio data and a corresponding relation between each text segment and the audio segment in the text data; the audio clip is a section of audio in the audio data;

performing semantic recognition on the text data by using a preset sensitive information recognition model, and acquiring a sensitive text fragment set through the semantic recognition, wherein the sensitive text fragment set consists of sensitive text fragments in the text data;

and desensitizing the audio data according to the sensitive text fragment set and the corresponding relation between each text fragment and the audio fragment to obtain desensitized audio data.

In one embodiment, the desensitizing the audio data according to the sensitive text segment set and the corresponding relationship between each text segment and the audio segment includes:

receiving a sensitive text fragment selection instruction input by a user;

acquiring the selected sensitive text fragment from the sensitive text fragment set according to the instruction of the sensitive text fragment selection instruction;

and desensitizing the audio data according to the selected sensitive text segments and the corresponding relation between each text segment and the audio segment.

In one embodiment, the desensitizing the audio data according to the sensitive text segment set and the corresponding relationship between each text segment and an audio segment includes:

and desensitizing the audio data according to each sensitive text segment in the sensitive text segment set and the corresponding relation between each text segment and the audio segment.

In one embodiment, the desensitizing the audio data includes deleting an audio segment corresponding to the sensitive text segment or overwriting an audio segment corresponding to the sensitive text segment.

In one embodiment, the preset sensitive information recognition model is a natural language processing NLP neural network model.

In one embodiment, the performing voice recognition on the audio data to obtain text data corresponding to the audio data and a corresponding relationship between each text segment in the text data and the audio segment includes:

and inputting the audio data into a preset voice recognition model to obtain text data corresponding to the audio data output by the voice recognition model and the corresponding relation between each text segment and the audio segment in the text data.

In one embodiment, the speech recognition model is a neural network model comprising hidden markov HMM, convolutional neural network CNN, and weighted finite state machine WFST.

In a second aspect, an audio data desensitization apparatus, the apparatus comprising:

the first acquisition module is used for carrying out voice recognition on the audio data to obtain text data corresponding to the audio data and the corresponding relation between each text segment and the audio segment in the text data; the audio clip is a section of audio in the audio data;

the second acquisition module is used for carrying out semantic recognition on the text data by using a preset sensitive information recognition model and acquiring a sensitive text fragment set through the semantic recognition, wherein the sensitive text fragment set consists of sensitive text fragments in the text data;

and the desensitization module is used for desensitizing the audio data according to the sensitive text segment set and the corresponding relation between each text segment and the audio segment to obtain the desensitized audio data.

In a third aspect, a computer device comprises a memory storing a computer program and a processor implementing the method steps of the audio data desensitization method described above when the computer program is executed.

In a fourth aspect, a computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method steps of the audio data desensitization method described above.

The terminal carries out voice recognition on the audio data to obtain text data corresponding to the audio data and the corresponding relation between each text segment and each audio segment in the text data, wherein the audio segments are audio in the audio data, semantic recognition is carried out on the text data by using a preset sensitive information recognition model, a sensitive text segment set is obtained through the semantic recognition, the sensitive text segment set is composed of sensitive text segments in the text data, desensitization processing is carried out on the audio data according to the sensitive text segment set and the corresponding relation between each text segment and each audio segment to obtain desensitized audio data, and the voice desensitization processing on the audio data is automatically obtained according to the sensitive text segment set and the corresponding relation between each text segment and each audio segment, the sensitive text fragment set is obtained by automatically performing semantic recognition on text data corresponding to the audio data through a preset sensitive information recognition model, and the text data corresponding to the audio data is automatically obtained by performing voice recognition on the audio data through a terminal, namely, in the process of performing voice desensitization processing on the audio data, all the steps are automatically obtained, so that the process of performing desensitization on the audio data manually is avoided, and the efficiency of desensitization on the audio data is improved.

Drawings

FIG. 1 is a diagram illustrating an example of an environment in which a method for desensitizing audio data is applied in one embodiment;

FIG. 2 is a schematic flow diagram of a method for desensitizing audio data in one embodiment;

FIG. 3 is a schematic flow chart of a method for desensitizing audio data in another embodiment;

FIG. 4 is a schematic diagram of the structure of an audio data desensitizing apparatus provided in one embodiment;

FIG. 5 is a schematic diagram of the structure of an audio data desensitizing apparatus provided in another embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

The application provides an audio data desensitization method, an audio data desensitization device, an audio data desensitization equipment and a storage medium, and aims to solve the problem of low efficiency of audio data desensitization. The following describes in detail the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems by embodiments and with reference to the drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

The audio data desensitization method provided by the embodiment can be applied to the application environment shown in fig. 1. Where the audio data desensitization terminal 102 communicates with the server 104 over a network. The audio data desensitization terminal 102 may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

It should be noted that, in the audio data desensitization method provided in the embodiment of the present application, the execution main body may be an audio data desensitization apparatus, and the apparatus may be implemented as an audio data desensitization terminal portion or all of the audio data desensitization terminal portion in a software, hardware, or a combination of software and hardware.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments.

Fig. 2 is a flow diagram illustrating a method of desensitizing audio data in one embodiment. The embodiment relates to a specific process of automatically desensitizing audio data. As shown in fig. 2, the method comprises the steps of:

s101, performing voice recognition on the audio data to obtain text data corresponding to the audio data and corresponding relations between each text segment and the audio segment in the text data; the audio clip is a piece of audio in the audio data.

The audio data may be audio data generated when the user communicates through social software, or audio data generated when the user communicates through communication equipment, or audio data obtained by recording the audio data through recording equipment, which is not limited in the embodiment of the present application. The audio segment may be a piece of audio in the audio data, including start time information and end time information of the piece of audio in the audio data. The text data may be obtained by performing speech recognition on the audio data, where the text data may include a plurality of text segments, and each text segment may be a word in the text data, or a segment in the text data, which is not limited in this embodiment of the application. There is a one-to-one correspondence between each text segment in the text data and each audio segment in the audio data. The terminal may perform Speech Recognition on the audio data by means of Speech Recognition technology, also known as Automatic Speech Recognition (ASR), which aims at converting the vocabulary content in the audio data into computer-readable input, such as keystrokes, binary codes or character sequences.

The terminal can perform voice recognition on the audio data in communication, and can also call the audio data stored in the server to perform voice recognition on the stored audio data, which is not limited in the embodiment of the application. When voice recognition is carried out on the audio data to obtain the corresponding text data of the audio data, the corresponding relation between each text segment in the text data and each audio segment in the audio data is also obtained at the same time. For example, the audio data is 5S audio data, the audio data is identified, and the obtained text data is "today' S air temperature is 25 ℃", wherein the audio segment corresponding to the text segment "today" is an audio segment between 0S and 1S, the audio segment corresponding to the text segment "is an audio segment between 1S and 2S, the audio segment corresponding to the text segment" air temperature "is an audio segment between 2S and 3S, and the audio segment corresponding to the text segment" 25 ℃ "is an audio segment between 3S and 5S.

S102, performing semantic recognition on the text data by using a preset sensitive information recognition model, and acquiring a sensitive text fragment set through the semantic recognition, wherein the sensitive text fragment set is composed of sensitive text fragments in the text data.

The preset sensitive information identification model can be used for performing semantic identification on the text data, identifying the semantics of the text data, and determining whether each text segment in the text data is a model of a sensitive text segment according to the semantics of each text segment in the text data, which can be a neural network model. The sensitive text fragment can be a text fragment corresponding to sensitive information, wherein the sensitive information can be information which is not allowed to be transmitted and is regulated by related laws and regulations, can also be information related to the privacy of the user, and can also be information related to the safety of the user, for example, the sensitive information can be a bank card password of the user, and can also be information which is not suitable for being referred by minors. The sensitive text fragment set may include one sensitive text fragment, may also include a plurality of sensitive text fragments, and may also include zero sensitive text fragments, which is not limited in this embodiment of the application.

S103, desensitizing the audio data according to the sensitive text segment set and the corresponding relation between each text segment and the audio segment to obtain desensitized audio data.

On the basis of the above embodiment, when the sensitive text segment set and the corresponding relationship between each text segment and the audio segment are obtained, desensitization processing may be performed on the initial audio data according to the sensitive text segment set and the corresponding relationship between each text segment and the audio segment, so that no sensitive information exists in the audio data, and the desensitized audio data is obtained. The duration of the desensitized audio data may be the same as the duration of the initial audio data, or may be smaller than the duration of the initial audio data, which is not limited in this embodiment of the application.

The audio data desensitization method includes the steps that a terminal carries out voice recognition on audio data to obtain text data corresponding to the audio data and corresponding relations between text fragments and audio fragments in the text data, wherein the audio fragments are audio in the audio data, semantic recognition is carried out on the text data by using a preset sensitive information recognition model, a sensitive text fragment set is obtained through the semantic recognition, the sensitive text fragment set is composed of sensitive text fragments in the text data, desensitization processing is carried out on the audio data according to the sensitive text fragment set and the corresponding relations between the text fragments and the audio fragments to obtain desensitized audio data, voice desensitization processing on the audio data is automatically obtained according to the sensitive text fragment set and the corresponding relations between the text fragments and the audio fragments, and the sensitive text fragment set carries out voice desensitization on the text data corresponding to the audio data through the preset sensitive information recognition model The voice desensitization processing method based on the semantic recognition is characterized in that the voice desensitization processing method based on the semantic recognition is used for carrying out voice desensitization processing on the audio data, and the text data corresponding to the audio data is obtained automatically by the terminal through voice recognition on the audio data, namely, in the process of carrying out voice desensitization processing on the audio data, all steps are automatically obtained, so that the process of carrying out desensitization on the audio data manually is avoided, and the efficiency of desensitization of the audio data is improved.

Optionally, desensitization processing is performed on the audio data according to each sensitive text segment included in the sensitive text segment set and the corresponding relationship between each text segment and the audio segment.

In this embodiment, after the sensitive text segment set is obtained, each sensitive text segment in the sensitive text segment set may be automatically used as a processing object for desensitizing the audio data according to the corresponding relationship between each text segment and the audio segment, and desensitizing the audio data may be optionally performed, where desensitizing the audio data includes deleting the audio segment corresponding to the sensitive text segment or covering the audio segment corresponding to the sensitive text segment. That is to say, according to each sensitive text segment included in the sensitive text segment set and the corresponding relationship between each text segment and an audio segment, desensitization processing is performed on the audio data, and the desensitized audio data is obtained by automatically deleting or covering the audio segments corresponding to all the sensitive text segments in the audio data.

According to the audio data desensitization method, the terminal desensitizes the audio data according to each sensitive text segment in the sensitive text segment set and the corresponding relation between each text segment and the audio segment, so that the desensitized audio data are obtained by directly desensitizing each sensitive text segment in the sensitive text segment set, the desensitization treatment on the audio data is automatically completed by the terminal, and the intelligence of audio data desensitization is improved.

Fig. 3 is a flow chart illustrating a method of desensitizing audio data according to another embodiment. The embodiment relates to a specific process of desensitizing audio data according to a sensitive text segment set and a corresponding relation between each text segment and an audio segment. As shown in fig. 3, one possible implementation method of S103 "desensitize audio data according to the sensitive text segment set and the corresponding relationship between each text segment and an audio segment" includes the following steps:

s201, receiving a sensitive text segment selection instruction input by a user.

In this embodiment, the sensitive text segment selection instruction may be a voice command, a text command, or a touch instruction, which is not limited in this embodiment. Correspondingly, receiving a sensitive text segment selection instruction input by a user can receive the sensitive text segment selection instruction by receiving a voice command input by the user; or receiving a word command input by a user to receive a sensitive text segment selection instruction; the method can also be used for receiving a touch command input by a user to receive a sensitive text segment selection instruction; the embodiment of the present application does not limit this.

S202, acquiring the selected sensitive text fragment from the sensitive text fragment set according to the instruction of the sensitive text fragment selection instruction.

On the basis of the above embodiment, the sensitive text segments in the sensitive text segment set are obtained through the preset sensitive information identification model, and when the sensitive text segments identified by the preset sensitive information identification model are inaccurate, some non-sensitive information may be deleted by mistake if desensitization processing is directly performed on the audio data according to the sensitive text segment set. Therefore, when a sensitive text segment selection instruction input by a user is received, the selected sensitive text segment can be obtained from the sensitive text segment set. That is, the sensitive text segment corresponding to the non-sensitive information is removed by screening the sensitive text segment set by the user. The terminal can select all the sensitive text segments from the sensitive text segment set according to the sensitive text segment selection instruction, can also select part of the sensitive text segments, and can also not select the sensitive text segments, which is not limited in the embodiment of the application.

S203, desensitizing the audio data according to the selected sensitive text segments and the corresponding relation between the text segments and the audio segments.

According to the audio data desensitization method, the terminal receives a sensitive text segment selection instruction input by a user, acquires the selected sensitive text segment from the sensitive text segment set according to the instruction of the sensitive text segment selection instruction, and performs desensitization processing on the audio data according to the selected sensitive text segment and the corresponding relation between each text segment and the audio segment, so that the sensitive text segment corresponding to non-sensitive information is removed according to the selection of the user before the audio data is desensitized according to the sensitive text segment in the sensitive text segment set, and then the desensitization processing on the audio data according to the sensitive text segment in the sensitive text segment set is more accurate, and the accuracy of audio data desensitization is improved.

Optionally, the preset sensitive information recognition model is a natural language processing NLP neural network model.

In this embodiment, Natural Language Processing (NLP) is a sub-field of artificial intelligence, and is used to identify semantics in text data. In general, natural language processing can be implemented using a hybrid algorithm based on a Bi-directional convolutional neural network (Bi-RNN) and a Conditional Random Field (CRF). Of course, this application also protects the process of implementing natural language processing by other algorithms. NLP can consist of two main areas of technology: natural language understanding and natural language generation. The natural language understanding direction is mainly aimed at helping a machine to better understand human language, and comprises semantic understanding of basic lexical, syntax and the like and high-level understanding of requirements, sections and emotional levels. Natural language generation direction, the main goal is to help the machine generate languages that people can understand, such as text generation, automatic abstractions, etc. For example: when people search a rarely-used word, the word can be searched under the condition that the pinyin is not known: "4 what are also thought? "We found that the search result must be a matching result of the surfaces of these words that tell you what this" "word recites, rather than" 4 also recites ", where natural language processing is applied, which helps the computer to solve the problem that the user needs to search for words that are" 4 again "rather than" 4 again "which are several orphan zeros.

Optionally, the audio data is input into a preset speech recognition model, and text data corresponding to the audio data output by the speech recognition model and a corresponding relationship between each text segment and the audio segment in the text data are obtained.

The preset voice recognition model can be a neural network model, a mapping relation between audio data and text data is prestored in the preset voice recognition model, after the audio data is input into the preset voice recognition model, the preset voice recognition model outputs text data corresponding to the audio data according to the mapping relation between the audio data and the text data, and the corresponding relation between each text segment in the text data and the audio segment. Optionally, the speech recognition model is a neural network model comprising a hidden markov HMM, a convolutional neural network CNN, and a weighted finite state machine WFST.

It should be understood that although the steps in the flowcharts of fig. 2 or 3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 or 3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

Fig. 4 is a schematic structural diagram of an audio data desensitization apparatus provided in an embodiment, as shown in fig. 4, the audio data desensitization apparatus includes: a first acquisition module 10, a second acquisition module 20, and a desensitization module 30, wherein:

the first obtaining module 10 is configured to perform voice recognition on the audio data to obtain text data corresponding to the audio data and a corresponding relationship between each text segment in the text data and the audio segment; the audio clip is a section of audio in the audio data;

the second obtaining module 20 is configured to perform semantic recognition on the text data by using a preset sensitive information recognition model, and obtain a sensitive text fragment set through the semantic recognition, where the sensitive text fragment set is composed of sensitive text fragments in the text data;

the desensitization module 30 is configured to perform desensitization processing on the audio data according to the sensitive text segment set and the corresponding relationship between each text segment and an audio segment, so as to obtain desensitized audio data.

The audio data desensitization device provided by the embodiment of the application can execute the method embodiment, the implementation principle and the technical effect are similar, and details are not repeated herein.

Fig. 5 is a schematic structural diagram of an audio data desensitization apparatus provided in another embodiment, and based on the embodiment shown in fig. 4, as shown in fig. 5, a desensitization module 30 includes: a receiving unit 301, a selecting unit 302 and a desensitizing unit 303, wherein:

the receiving unit 301 is configured to receive a sensitive text fragment selection instruction input by a user;

the selecting unit 302 is configured to obtain the selected sensitive text segment from the sensitive text segment set according to the instruction of the sensitive text segment selecting instruction;

the desensitization unit 303 is configured to perform desensitization processing on the audio data according to the selected sensitive text segments and the corresponding relationship between each text segment and the audio segment.

In an embodiment, the desensitization module 30 is specifically configured to perform desensitization processing on the audio data according to each sensitive text segment included in the sensitive text segment set and a corresponding relationship between each text segment and an audio segment.

In one embodiment, desensitizing the audio data includes deleting the audio segment corresponding to the sensitive text segment or overwriting the audio segment corresponding to the sensitive text segment.

In one embodiment, the preset sensitive information recognition model is a Natural Language Processing (NLP) neural network model.

In an embodiment, the first obtaining module 10 is specifically configured to input audio data into a preset speech recognition model, to obtain text data corresponding to the audio data output by the speech recognition model, and a corresponding relationship between each text segment and an audio segment in the text data.

In one embodiment, the speech recognition model is a neural network model comprising a hidden Markov HMM, a convolutional neural network CNN, and a weighted finite state machine WFST.

For a specific limitation of the audio data desensitization device, reference may be made to the above limitation of the audio data desensitization method, which is not described herein again. The various modules in the audio data desensitization apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal device, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of desensitizing audio data. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a terminal device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

In one embodiment, the processor, when executing the computer program, further performs the steps of: receiving a sensitive text fragment selection instruction input by a user; acquiring the selected sensitive text fragment from the sensitive text fragment set according to the instruction of the sensitive text fragment selection instruction; and desensitizing the audio data according to the selected sensitive text segments and the corresponding relation between each text segment and the audio segment.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and desensitizing the audio data according to each sensitive text segment in the sensitive text segment set and the corresponding relation between each text segment and the audio segment.

In an embodiment, the desensitizing the audio data includes deleting an audio segment corresponding to the sensitive text segment or overwriting an audio segment corresponding to the sensitive text segment.

In one embodiment, the predetermined sensitive information recognition model is a natural language processing NLP neural network model.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and inputting the audio data into a preset voice recognition model to obtain text data corresponding to the audio data output by the voice recognition model and the corresponding relation between each text segment and the audio segment in the text data.

The implementation principle and technical effect of the terminal device provided in this embodiment are similar to those of the method embodiments described above, and are not described herein again.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor implements the steps of: receiving a sensitive text fragment selection instruction input by a user; acquiring the selected sensitive text fragment from the sensitive text fragment set according to the instruction of the sensitive text fragment selection instruction; and desensitizing the audio data according to the selected sensitive text segments and the corresponding relation between each text segment and the audio segment.

In one embodiment, the computer program when executed by the processor implements the steps of: and desensitizing the audio data according to each sensitive text segment in the sensitive text segment set and the corresponding relation between each text segment and the audio segment.

In one embodiment, the computer program when executed by the processor implements the steps of: and inputting the audio data into a preset voice recognition model to obtain text data corresponding to the audio data output by the voice recognition model and the corresponding relation between each text segment and the audio segment in the text data.

The implementation principle and technical effect of the computer-readable storage medium provided by this embodiment are similar to those of the above-described method embodiment, and are not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of audio data desensitization, the method comprising:

performing voice recognition on audio data to obtain text data corresponding to the audio data and a corresponding relation between each text segment and an audio segment in the text data; the audio clip is a section of audio in the audio data;

performing semantic recognition on the text data by using a preset sensitive information recognition model, and acquiring a sensitive text fragment set through the semantic recognition, wherein the sensitive text fragment set consists of sensitive text fragments in the text data; the preset sensitive information identification model is used for identifying the semantics of each text segment in the text data and determining whether each text segment in the text data is a sensitive text segment according to the semantics of each text segment in the text data;

desensitizing the audio data according to the sensitive text fragment set and the corresponding relation between each text fragment and the audio fragment to obtain desensitized audio data;

the voice recognition of the audio data to obtain text data corresponding to the audio data and a corresponding relationship between each text segment and an audio segment in the text data includes:

inputting the audio data into a preset voice recognition model to obtain text data corresponding to the audio data output by the voice recognition model and corresponding relations between text segments and audio segments in the text data; the speech recognition model is a neural network model comprising a hidden markov HMM, a convolutional neural network CNN, and a weighted finite state machine WFST.

2. The method of claim 1, wherein desensitizing the audio data according to the set of sensitive text segments and the correspondence between each of the text segments and an audio segment comprises:

receiving a sensitive text fragment selection instruction input by a user;

acquiring the selected sensitive text fragment from the sensitive text fragment set according to the indication of the sensitive text fragment selection instruction;

3. The method of claim 1, wherein desensitizing the audio data according to the set of sensitive text segments and the correspondence between each of the text segments and an audio segment comprises:

4. The method of any of claims 1-3, wherein desensitizing the audio data comprises deleting audio segments corresponding to sensitive text segments or overwriting audio segments corresponding to sensitive text segments.

5. The method according to any one of claims 1 to 3, wherein the preset sensitive information recognition model is a Natural Language Processing (NLP) neural network model; the preset sensitive information recognition model realizes natural language processing by using a hybrid algorithm based on a Bi-directional convolutional neural network Bi-RNN and a conditional random field CRF.

6. An audio data desensitization apparatus, characterized in that the apparatus comprises:

the first acquisition module is used for carrying out voice recognition on audio data to obtain text data corresponding to the audio data and the corresponding relation between each text segment and the audio segment in the text data; the audio clip is a section of audio in the audio data;

the second acquisition module is used for performing semantic recognition on the text data by using a preset sensitive information recognition model and acquiring a sensitive text fragment set through the semantic recognition, wherein the sensitive text fragment set consists of sensitive text fragments in the text data; the preset sensitive information identification model is used for identifying the semantics of each text segment in the text data and determining whether each text segment in the text data is a sensitive text segment according to the semantics of each text segment in the text data;

the desensitization module is used for desensitizing the audio data according to the sensitive text segment set and the corresponding relation between each text segment and the audio segment to obtain desensitized audio data;

the first obtaining module is specifically configured to input the audio data into a preset speech recognition model, so as to obtain text data corresponding to the audio data output by the speech recognition model and a corresponding relationship between each text segment and an audio segment in the text data; the speech recognition model is a neural network model comprising a hidden markov HMM, a convolutional neural network CNN, and a weighted finite state machine WFST.

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method according to any of claims 1-5 when executing the computer program.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.