CN112216275A

CN112216275A - Voice information processing method and device and electronic equipment

Info

Publication number: CN112216275A
Application number: CN201910619889.4A
Authority: CN
Inventors: 贾锦杰; 肖成志; 曹凌
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-07-10
Filing date: 2019-07-10
Publication date: 2021-01-12

Abstract

The invention discloses a processing method and a device of sound information, electronic equipment and a computer readable storage medium, wherein the processing method comprises the following steps: receiving a control voice input by a user; acquiring target voice information to be processed and a corresponding target processing instruction according to the control voice; and carrying out corresponding processing on the target voice information according to the target processing instruction to obtain the processed target voice information.

Description

Voice information processing method and device and electronic equipment

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method and an apparatus for processing voice information, an electronic device, and a computer-readable storage medium.

Background

With the rapid development of electronic technology, more and more electronic devices can provide functions such as voice control and/or voice editing.

The existing voice application software provides a very weak voice editing function, and electronic equipment without a screen, such as a smart sound box or a wireless earphone, has no voice editing function. Therefore, it is necessary to re-record the voice for editing by using an electronic device having a screen such as a computer or a mobile phone, or to copy the voice recorded by using an electronic device having a screen such as a smart speaker or a wireless earphone to the electronic device having a screen such as a computer or a mobile phone for editing.

These approaches limit the processing power and speed of the voice information, affecting the user experience.

Disclosure of Invention

An object of the present invention is to provide a new technical solution for processing voice information by controlling voice.

According to a first aspect of the present invention, there is provided a method for processing voice information, including:

receiving a control voice input by a user;

acquiring target voice information to be processed and a corresponding target processing instruction according to the control voice;

and correspondingly processing the target voice information according to the target processing instruction to obtain the processed target voice information.

Optionally, the step of obtaining the target voice information includes:

converting the control voice into a corresponding control text, and extracting attribute keywords from the control text according to a pre-constructed attribute word bank; wherein the attribute keywords at least comprise a name and/or a time;

and acquiring the target voice information according to the attribute keywords.

Optionally, the step of obtaining the target processing instruction includes:

converting the control voice into a corresponding control text, extracting instruction keywords from the target voice text according to a pre-constructed instruction word bank, and performing structural analysis on the instruction keywords through a structural model to obtain a processing instruction corresponding to the instruction keywords as the target processing instruction;

the structured model is a model for obtaining the processing instruction by carrying out structured organization on the collected instruction vocabulary related to the processing instruction.

Optionally, the step of performing corresponding processing on the target voice information according to the target processing instruction to obtain the processed target voice information includes:

determining a voice segment to be processed in the target voice information as a target voice segment according to the control voice;

and correspondingly processing the target voice fragment according to the target processing instruction to obtain the processed target voice information.

Optionally, the step of determining a speech segment to be processed in the target speech information according to the control speech, as a target speech segment, includes:

acquiring a first voice oscillogram corresponding to the target voice information;

acquiring a second voice oscillogram corresponding to the control voice, and extracting a positioning oscillogram from the control voice according to a pre-constructed oscillogram library;

according to the positioning oscillogram, determining a waveform segment to be processed in the first voice oscillogram as the target waveform segment;

and obtaining the target voice segment according to the target waveform segment.

Optionally, the positioning waveform map includes a first positioning waveform map and a second positioning waveform map;

the step of determining the waveform segment to be processed in the first voice waveform map as the target waveform segment according to the positioning waveform map comprises:

determining a waveform segment in the first voice waveform map, which is matched with the first positioning waveform map, as a first waveform segment; determining a waveform segment in the first voice waveform map, which is matched with the second positioning waveform map, as a second waveform segment;

and taking the waveform segment between the first waveform segment and the second waveform segment as the target waveform segment.

converting the content of the control voice into a corresponding control text, and extracting time keywords from the control text according to a pre-constructed time word bank;

and determining the target voice fragment in the target voice information according to the time keyword.

Optionally, the processing manner corresponding to the target processing instruction at least includes: noise reduction processing, volume adjustment processing, mosaic processing, play speed adjustment processing, and/or deletion processing.

Optionally, the processing mode corresponding to the target processing instruction is insertion processing,

the step of performing corresponding processing on the target voice information according to the target processing instruction to obtain the processed target voice information comprises the following steps:

determining an insertion node in the target voice information according to the control voice;

responding to the operation of recording the voice again, and collecting new voice information;

and inserting the new voice information into the target voice information according to the inserting node to obtain the processed target voice information.

Optionally, the processing method further includes:

and responding to a playing request of the voice information, and playing the processed target voice information.

Optionally, the step of playing the processed target voice information in response to the request for playing the voice information includes:

responding to the playing request, and selecting a voice segment meeting the set requirement from the processed target voice information as a recommended voice segment;

and playing the recommended voice clip.

Optionally, the processing method further includes:

and storing the processed target voice information.

Optionally, the correspondingly processing the target voice information according to the target processing instruction to obtain the processed target voice information further includes:

and replacing the saved processed target voice information with the target voice information before processing in response to a revocation processing request input by a user.

According to a second aspect of the present invention, there is provided a processing apparatus of voice information, comprising:

the control voice receiving module is used for receiving control voice input by a user;

the information instruction acquisition module is used for acquiring target voice information to be processed and a corresponding target processing instruction according to the control voice;

and the information processing module is used for carrying out corresponding processing on the target voice information according to the target processing instruction to obtain the processed target voice information.

According to a third aspect of the present invention, there is provided an electronic apparatus comprising:

a processing apparatus according to the second aspect of the invention; alternatively, the first and second electrodes may be,

a processor and a memory for storing instructions for controlling the processor to perform a method of processing according to the first aspect of the invention.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the processing method according to the first aspect of the present invention.

In the embodiment of the invention, the target voice information to be processed and the corresponding target processing instruction are obtained through the control voice input by the user, and the target voice information is correspondingly processed according to the target processing instruction to obtain the processed target voice information. In this way, the target voice information can be processed only by the control voice. Target voice information recorded by electronic equipment such as an intelligent sound box and an earphone which are not provided with a display screen can be copied to other electronic equipment such as a mobile phone and a computer with the display screen for processing, so that the operation of a user can be facilitated, and the user experience is improved.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a block diagram showing an example of a hardware configuration of an electronic apparatus that can be used to implement an embodiment of the present invention.

Fig. 2 shows a flowchart of a processing method of voice information of the first embodiment of the present invention.

Fig. 3 is a diagram illustrating steps of a method for processing voice information according to an embodiment of the present invention.

Fig. 4 shows a flowchart of a processing method of voice information of the second embodiment of the present invention.

Fig. 5a shows a schematic view of one example of a presentation interface according to a second embodiment of the invention.

Fig. 5b shows a schematic view of one example of a presentation interface according to a second embodiment of the present invention.

Fig. 5c shows a schematic view of one example of a presentation interface according to a second embodiment of the invention.

Fig. 6 shows a block diagram of a speech information processing apparatus of an embodiment of the present invention.

FIG. 7 shows a block diagram of one example of an electronic device of an embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

< hardware configuration >

Fig. 1 is a block diagram showing a hardware configuration of an electronic apparatus 1000 that can implement an embodiment of the present invention.

The electronic device 1000 may be a laptop, a desktop computer, a mobile phone, a tablet computer, a speaker, an earphone, etc. As shown in fig. 1, the electronic device 1000 may include a processor 1100, a memory 1200, an interface device 1300, a communication device 1400, a display device 1500, an input device 1600, a speaker 1700, a microphone 1800, and the like. The processor 1100 may be a central processing unit CPU, a microprocessor MCU, or the like. The memory 1200 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1300 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1400 is capable of wired or wireless communication, for example, and may specifically include Wifi communication, bluetooth communication, 2G/3G/4G/5G communication, and the like. The display device 1500 is, for example, a liquid crystal display panel, a touch panel, or the like. The input device 1600 may include, for example, a touch screen, a keyboard, a somatosensory input, and the like. A user can input/output voice information through the speaker 1700 and the microphone 1800.

The electronic device shown in fig. 1 is merely illustrative and is in no way meant to limit the invention, its application, or uses. In an embodiment of the present invention, the memory 1200 of the electronic device 1000 is configured to store instructions for controlling the processor 1100 to operate so as to execute any one of the processing methods of voice information provided by the embodiment of the present invention. It will be appreciated by those skilled in the art that although a plurality of means are shown for the electronic device 1000 in fig. 1, the present invention may relate to only some of the means therein, e.g. the electronic device 1000 relates to only the processor 1100 and the storage means 1200. The skilled person can design the instructions according to the disclosed solution. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.

< method examples >

< first embodiment >

In the general concept of this embodiment, a processing scheme of voice information is provided, where target voice information to be processed and a corresponding target processing instruction are obtained through control voice input by a user, and the target voice information is correspondingly processed according to the target processing instruction, so as to obtain processed target voice information. Therefore, the voice information can be processed without the participation of a display screen. In this way, the target voice information can be processed only by the control voice. Target voice information recorded by electronic equipment such as an intelligent sound box and an earphone which are not provided with a display screen can be copied to other electronic equipment such as a mobile phone and a computer with the display screen for processing, so that the operation of a user can be facilitated, and the user experience is improved.

In the present embodiment, a method for processing voice information is provided. The processing method of the voice information can be implemented by the electronic equipment. The electronic device may be the electronic device 1000 as shown in fig. 1.

As shown in fig. 2, the method for processing voice information of the present embodiment may include the following steps S1000 to S3000:

step S1000, receiving control voice input by a user.

In one embodiment, a user inputs control speech through a microphone disposed on an electronic device that performs embodiments of the present invention.

The electronic device executing the embodiment of the invention can receive the control voice input by the user under the condition of starting up the electronic device, and also can receive the control voice input by the user under the condition of starting up the acquisition function of the control voice. Specifically, the function of collecting and controlling the voice may be triggered by the electronic device under the condition that the user inputs the designated wake-up voice; the function of collecting control voice can be triggered by the electronic equipment under the condition that the user presses a designated button on the electronic equipment.

And step S2000, acquiring target voice information to be processed and a corresponding target processing instruction according to the control voice.

In one embodiment, the target voice information to be processed may be voice information recorded by itself or voice information acquired from other devices before or after performing step S1000.

Specifically, the step of acquiring the target voice information to be processed according to the control voice may include steps S2110 to S2120 as follows:

and step S2110, converting the control voice into a corresponding control text, and extracting attribute keywords from the control text according to a pre-constructed attribute word bank.

In this example, the content of the control speech may be passed through a speech recognition engine or a speech-to-text tool, plug-in, etc. to obtain the corresponding control text.

The voice information stored in the electronic device implementing the embodiment and the voice information stored in other devices each have a unique attribute, and the corresponding voice information can be uniquely determined according to the attributes. The attribute may be a name and/or a storage time. Then, the control voice input by the user may include the attribute of the voice information to be processed, so that the electronic device can accurately acquire the corresponding target voice information.

The attribute lexicon of the embodiment may include a plurality of vocabularies respectively representing attributes of different speech information. In this example, the property lexicon can be constructed in advance by mining the property vocabularies manually or by a machine.

According to the attribute word bank, similarity analysis can be performed on words obtained by word segmentation of the control text and attribute words included in the attribute word bank through methods such as cosine similarity, and the attribute words with similarity higher than a preset similarity threshold or the attribute words with highest similarity are extracted to serve as the attribute keywords.

The attribute key may include a name and/or a time, among others. The time attribute key in this embodiment may be a specific time, for example, 9 points, or may be a fuzzy time, for example, before, after, just before, etc.

Step S2120, target voice information is obtained according to the attribute keywords.

Specifically, one matching the attribute keyword may be selected from the voice information stored in advance in the electronic device implementing the embodiment as the target voice information. One of the voice information stored in the other electronic device that matches the attribute keyword may be acquired as the target voice information by downloading, copying, or pasting. It is also possible to record voice information as target voice information by a microphone provided on the electronic device that performs the present embodiment.

For example, the attribute keyword extracted in step S2110 is only "after", and a voice recording function may be started to record voice information as target voice information.

For another example, if the attribute keyword extracted in step S2110 is "aaaa", then the audio information named "aaaa" may be obtained as the target voice information from other electronic devices and/or the electronic device executing the embodiment of the present invention.

In an embodiment, each time a new voice message is stored or recorded in the electronic device, the attribute of the new voice message may be updated to the attribute lexicon, so that the new voice message may be acquired for processing according to the attribute keyword of the new voice message.

In one embodiment, the step of obtaining the corresponding target processing instruction according to the control voice may include:

and converting the control voice into a corresponding control text, extracting instruction keywords from the target voice text according to a pre-constructed instruction word bank, and performing structural analysis on the instruction keywords through a structural model to obtain a processing instruction corresponding to the instruction keywords as a target processing instruction.

The instruction vocabulary library of the embodiment may include a plurality of instruction vocabularies respectively embodying different processing instructions. In this example, the instruction vocabulary can be mined manually or by machine to construct an instruction vocabulary library in advance.

According to the instruction word bank, similarity analysis can be performed on words obtained by word segmentation of the control text and instruction words included in the instruction word bank through methods such as cosine similarity, and the instruction words with similarity higher than a preset similarity threshold or the instruction words with highest similarity are extracted to serve as instruction keywords.

The structured model is a model for obtaining the processing instruction by carrying out structured organization on the collected vocabulary related to the processing instruction. Each instruction vocabulary included in the structured model has a corresponding processing instruction.

In this example, the instruction vocabulary obtained by manual or machine mining in advance may be associated with the processing instruction. Through the structural model, structural analysis is carried out on the instruction keywords, and processing instructions corresponding to the instruction keywords can be obtained.

Therefore, the instruction keywords are extracted from the text information corresponding to the content of the voice information through the preset instruction word bank, and then the structural analysis is performed on the instruction keywords through the structural model in the embodiment to obtain the corresponding target processing instruction, so that a large number of voice samples do not need to be collected, and the target processing instruction for controlling the content of the voice to be embodied is quickly and effectively obtained through a simpler structural analysis means.

And step S3000, performing corresponding processing on the target voice information according to the target processing instruction to obtain the processed target voice information.

The manner in which the target speech information is processed may be determined by the target processing instructions. And executing the target processing instruction aiming at the target voice information to obtain the processed target voice information.

The processing mode corresponding to the target processing instruction at least comprises the following steps: noise reduction processing, volume adjustment processing, mosaic processing, play speed adjustment processing, and/or deletion processing.

If the processing mode corresponding to the target processing instruction is noise reduction processing, the noise reduction processing may be performed on the target voice information. If the processing mode corresponding to the target processing instruction is volume increasing (or decreasing), the volume of the target voice information may be increased (or decreased). If the processing mode corresponding to the target processing instruction is to accelerate (or slow down) the playing speed, the processing may be to accelerate (or slow down) the playing speed corresponding to the target voice information. If the processing mode corresponding to the target processing instruction is deletion processing, the target voice information can be deleted. If the processing mode corresponding to the target processing instruction is mosaic processing, the target voice information can be silenced or replaced by the specified voice.

The processing mode corresponding to the target processing instruction may further include: adjusting the quality of the target voice message, and/or beautifying or filtering the sound in the target voice message.

In one embodiment, the target voice information may be processed correspondingly according to the target processing instruction.

In another embodiment, the segments in the target voice message may be processed accordingly according to the target processing instruction.

In this example, the step of performing corresponding processing on the target speech information according to the target processing instruction to obtain the processed target speech information may include steps S3100 to S3200 as follows:

and step S3100, determining a voice segment to be processed in the target voice information as a target voice segment according to the control voice.

In one example, the step of determining the target speech segment may include steps S3111-S3114 as follows:

step S3111, a first speech waveform diagram corresponding to the target speech information is obtained.

The step of acquiring the first voice oscillogram corresponding to the target voice information may include:

decompressing the target voice message and randomly dividing the target voice message into a plurality of data blocks; acquiring sampling points and amplitude values of the sampling points in each data block according to a preset sampling mode; and sequencing the sampling points according to time, and generating a first voice oscillogram according to the amplitude value of each sampling point.

In the process of acquiring the first voice oscillogram corresponding to the target voice information in this embodiment, sampling points may be acquired according to time, and only data of the sampling points is selected to generate the first voice oscillogram, so that the calculation amount may be reduced. Furthermore, the target voice information can be decompressed first, the decompressed target voice information is randomly divided into a plurality of data blocks, the size of each data block is not fixed, then a sampling point is obtained in each data block according to a preset sampling mode, because the data blocks divided in advance are divided according to a random mode and are sampled in each data block according to a fixed mode, the sampling point obtained by the sampling mode with both randomness and regularity can better represent the target voice information, after the sampling point of each data block is obtained, the sampling points are also sequenced according to the time in the target voice information, which is equivalent to that sampling is carried out in a sampling mode containing both randomness and regularity in the voice information sample, and after the sampling point is obtained, data corresponding to the sampling point, such as an amplitude value, is obtained, a first speech waveform map is generated.

The speech waveform diagram of this embodiment may include information such as loudness, timbre, and frequency of the corresponding speech, for example, in the speech waveform diagram, the upper and lower amplitudes represent loudness, the combination of frequencies represents timbre, and the period interval represents frequency.

Step S3112, obtaining a second speech waveform corresponding to the control speech, and extracting a positioning waveform from the control speech according to a pre-constructed waveform library.

In this embodiment, the manner of obtaining the second speech waveform map may refer to the manner of obtaining the first speech waveform map, which is not described herein again.

The waveform library of this embodiment may include a plurality of waveform diagrams respectively representing phonetic characters or words. In this example, the waveform map library can be constructed in advance by mining the waveform map of the phonetic characters or words manually or mechanically.

According to the waveform map library, similarity analysis can be performed on the waveform map included in the waveform map library and the second voice waveform map through methods such as cosine similarity, and a part of the second voice waveform map, which has similarity higher than a preset similarity threshold value with the waveform map included in the waveform map library, or a part with highest similarity is extracted as a positioning waveform map.

One or more positioning waveforms extracted in step S3112 may be used.

Step S3113, determining a waveform segment to be processed in the first speech waveform map as a target waveform segment according to the positioning waveform map.

In one example, the positioning waveform map includes a first positioning waveform map and a second positioning waveform map, and the step of obtaining the target speech segment according to the target waveform segment includes:

determining a waveform segment in the first voice waveform map, which is matched with the first positioning waveform map, as a first waveform segment; determining a waveform segment in the first voice waveform map, which is matched with the second positioning waveform map, as a second waveform segment; and taking the waveform segment between the first waveform segment and the second waveform segment as a target waveform segment.

Specifically, the target waveform segment may include the first waveform segment and/or the second waveform segment, or may not include the first waveform segment and the second waveform segment.

And S3114, obtaining a target voice segment according to the target waveform segment.

Specifically, each sampling point in the waveform diagram has a corresponding time attribute, and therefore, the obtained target waveform segment also has a time attribute. For example, the time attribute corresponding to the target waveform segment may be 12 th to 13 th s. Then, according to the time attribute of the target waveform segment, the voice segment with the same time attribute in the target voice information can be determined, i.e. the target voice segment.

In another example, the step of determining the target speech segment may include steps S3121 to S3122 as follows:

and S3121, converting the content of the control voice into a corresponding control text, and extracting time keywords from the control text according to a pre-constructed time word bank.

The time lexicon of the embodiment may include a plurality of words respectively representing different times. In this example, the time lexicon can be pre-constructed by mining these time words manually or by machine.

According to the time word bank, similarity analysis can be performed on words obtained by word segmentation of the control text and time words included in the time word bank through methods such as cosine similarity, and time words with similarity higher than a preset similarity threshold or time words with highest similarity are extracted to serve as time keywords.

The time keyword extracted in this embodiment may be one or more. For example, the extracted temporal keyword may be 12 th, later, earlier, to, and/or 15 th, etc.

And S3122, determining the target voice segment in the target voice information according to the time key words.

For example, if the time keyword extracted in step S3121 includes "12 th S", "up", "15 th S", then the content segments of 12 th to 15 th S in the target speech information may be regarded as the target speech segment.

For another example, if the time keyword extracted in step S3121 includes "12 th" and "after", then all content segments after 12 th in the target speech information may be regarded as target speech segments.

Step S3200, performing corresponding processing on the target voice segment according to the target processing instruction, to obtain processed target voice information.

The processing manner of the target voice segment is similar to the foregoing processing manner of the target voice information, and is not described herein again.

In one embodiment, the processing method may further include:

and responding to the playing request of the voice information, and playing the processed target voice information.

In this example, in response to a request for playing the voice information, the step of playing the processed target voice information may include:

responding to the playing request, and playing the complete processed target voice information; alternatively, the first and second electrodes may be,

responding to the playing request, and selecting a voice segment meeting the set requirement from the processed target voice information as a recommended voice segment; and playing the recommended voice clip.

The setting requirement can be preset according to the application scene or specific requirements. For example, the setting requirement may be a specified period of time, between specified keywords, and/or a volume greater than a threshold, etc.

In one embodiment, the processing method may further include: and storing the processed target voice information.

On this basis, after step S3000 is executed, the processing method may further include:

and replacing the saved processed target voice information with the target voice information before processing in response to the revocation processing request input by the user.

In one embodiment, the processing method may further include: and in the case of receiving the control voice input by the user and/or obtaining the processed target voice, controlling the electronic equipment to vibrate to prompt the user.

< example 1>

The processing method of the voice information provided in the present embodiment will be further described below with reference to fig. 3.

As shown in fig. 3, the method for processing voice information includes: steps S3001 to S3006.

In step S3001, a control voice input by the user is received.

Step S3002, converting the control voice into a corresponding control text, extracting attribute keywords from the control text according to a pre-constructed attribute word bank, and acquiring target voice information according to the attribute keywords.

Step 3003, converting the control speech into a corresponding control text, extracting instruction keywords from the target speech text according to a pre-constructed instruction word bank, and performing structural analysis on the instruction keywords through a structural model to obtain a processing instruction corresponding to the instruction keywords as a target processing instruction.

Steps S3002 and S3003 may or may not be performed simultaneously, and are not particularly limited herein.

And step S3004, carrying out corresponding processing on the target voice information according to the target processing instruction to obtain the processed target voice information.

Step S3005, in response to the play request, selecting a voice segment meeting the setting requirement from the processed target voice information as a recommended voice segment.

Step S3006, playing the recommended voice segment.

< example 2>

On the basis of the above example 1, if the target processing instruction is an insertion instruction, then the step of performing corresponding processing on the target voice information according to the target processing instruction to obtain the processed target voice information includes:

determining an insertion node in the target voice information according to the control voice; responding to the operation of recording the voice again, and collecting new voice information; and inserting the new voice information into the target voice information according to the inserting node to obtain the processed target voice information.

The manner of determining the insertion node may be: acquiring a first voice oscillogram corresponding to target voice information; acquiring a second voice oscillogram corresponding to the control voice, and extracting a positioning oscillogram and a direction oscillogram from the control voice according to a pre-constructed oscillogram library, wherein the direction oscillogram can be a preset oscillogram corresponding to words of front and rear; and obtaining an insertion node in the target voice information according to the positioning oscillogram and the direction oscillogram.

For example, if the vocabulary corresponding to the positioning waveform map is "my" and the vocabulary corresponding to the directional waveform map is "front", then the inserted node in the target speech information may be the node before the position corresponding to the vocabulary "my".

The method for determining the insertion node may further include: converting the content of the control voice into a corresponding control text, and extracting time keywords from the control text according to a pre-constructed time word bank; and determining an insertion node in the target voice information according to the time key words.

For example, the extracted time keyword includes "12 th" and "after", and then, the nodes after 12 th in the target speech information may be used as the insertion nodes.

In this embodiment, the operation of re-recording the voice may be an operation of acquiring a target processing instruction whose corresponding processing mode is an insertion processing, an operation of pressing a corresponding button on the electronic device by the user, or an operation of inputting a designated wake-up voice by the user.

And responding to the operation of re-recording the voice, acquiring new voice information through a microphone arranged on the electronic equipment, and inserting the new voice information into an insertion node in the target voice information to obtain the processed target voice information.

< example 3>

On the basis of the foregoing example 1, if the target processing instruction is a privacy processing instruction, then performing corresponding processing on the target voice information according to the target processing instruction, and obtaining processed target voice information includes:

acquiring a first voice oscillogram corresponding to target voice information; and determining the privacy information contained in the target voice information according to a pre-constructed privacy gallery, and carrying out silencing or replacing the privacy information by specified voice to obtain the processed target voice information.

The privacy gallery in this example may include a plurality of waveform maps each embodying privacy information. In this example, the privacy gallery may be constructed in advance by mining the oscillogram of the privacy information manually or by a machine.

According to the waveform gallery, similarity analysis can be carried out on the waveform map included in the privacy gallery and a first voice waveform map through methods such as cosine similarity, and the part, with the similarity higher than a preset similarity threshold value, of the waveform map included in the waveform gallery in the first voice waveform map is used as the privacy waveform map; and determining the private voice information corresponding to the private oscillogram in the target voice information, and carrying out silencing or replacing the private voice information with specified voice to obtain the processed target voice information.

For example, the specified voice may be "beep".

< second embodiment >

In the present embodiment, a method for processing voice information is provided. The processing method of the voice information can be implemented by the electronic equipment. The electronic device may be any electronic device having a voice capturing function and a display function, and may be, for example, the electronic device 1000 shown in fig. 1.

As shown in fig. 4, the method for processing the voice information of the present embodiment may include the following steps S4100 to S4300:

in step S4100, a control voice input by the user is received.

In this embodiment, the step of receiving the control voice input by the user may refer to step S1000 in the first embodiment, which is not described herein again.

In one example, before performing step S4100, the processing method may further include: and providing a voice input inlet in the display interface, and responding to the operation of clicking the voice input inlet to execute the step of receiving the control voice input by the user. The voice input entry in the presentation interface may be as shown in fig. 5 a-5 c.

Step S4200, obtaining target voice information to be processed and a corresponding target processing instruction according to the control voice, and displaying the target voice information and the target processing instruction in a display interface.

In this embodiment, the step of obtaining the target speech information to be processed and the corresponding target processing instruction according to the control speech may refer to step S1000 in the first embodiment, and details are not repeated here.

In one example, the target voice information displayed in the display interface as shown in fig. 5a to 5c may include a name and/or a first time axis of the target voice information, and the like.

When the user performs an operation of clicking the name of the target voice information through a touch screen, a mouse, or the like, the target voice information may be played.

When a user clicks the first time axis of the target voice message through a touch screen, a mouse, or the like, the target voice message may be played from the corresponding node according to the position of the click point in the first time axis.

In one example, a plurality of preset processing instructions may be pre-displayed in the display interface, where the plurality of preset processing instructions include a target processing instruction. The way of exposing the processing instructions may be to expose the name of each processing instruction. Then, the step of exposing the target processing instruction may comprise: and highlighting the target processing instruction.

For example, before step S4200 is executed, each of the plurality of preset processing instructions displayed in the display interface may be in a first color, and when the operation of displaying the target processing instruction is executed, the target processing instruction may be modified to a second color.

For another example, before step S4200 is executed, the fonts of the names of the preset processing instructions displayed in the display interface may be all five-size characters, and when the operation of displaying the target processing instruction is executed, the font of the name of the target processing instruction may be modified to four-size characters. Reference may be made in particular to fig. 5a to 5 c.

Step S4300, the target voice information is correspondingly processed according to the target processing instruction to obtain the processed target voice information, and the processed target voice information is displayed in a display interface.

In this embodiment, corresponding processing is performed on the target voice information according to the target processing instruction, and the step of obtaining the processed target voice information may refer to step S3000 in the first embodiment, which is not described herein again.

In one example, the processed target voice information presented in the presentation interface may include a name of the processed target voice information and/or a second time axis, and/or the like.

When the user performs an operation of clicking the name of the processed target voice information through a touch screen, a mouse, or the like, the processed target voice information may be played.

When the user performs the click processing on the second time axis of the target voice information through a touch screen, a mouse, or the like, the processed target voice information may be played from the corresponding node according to the position of the click point in the second time axis.

In an example, a play button corresponding to the target voice information and the processed target voice information may be further provided in the presentation interface, and the target voice information is played in response to an operation of clicking the play button corresponding to the target voice information, or the processed target voice information is played in response to an operation of clicking the play button corresponding to the processed target voice information.

The step of displaying the target voice information comprises the following steps: displaying a first time axis corresponding to the target voice information; the step of displaying the processed target voice information comprises the following steps: in the embodiment of displaying the second time axis corresponding to the processed target voice information, the display method may further include:

an overlapping portion and a difference portion between the target voice information and the processed target voice information are determined, and the overlapping portion and/or the difference portion are marked in the first time axis and/or the second time axis.

If the processing mode corresponding to the target processing instruction is deletion processing, the part needing to be deleted in the target voice information can be a difference part, and other parts can be overlapped parts; the processed target voice information is an overlapping part as a whole. The presentation effect may be that the grey part of the time axis is used to indicate the difference part, as shown in fig. 5 a.

If the processing mode corresponding to the target processing instruction is insertion processing, the part inserted in the processed target voice information can be a difference part, and the other parts can be overlapping parts; the target speech information is entirely an overlapping portion. The presentation effect may be that the grey part of the time axis is used to indicate the difference part, as shown in fig. 5 b.

In this example, the step of performing corresponding processing on the target voice information according to the target processing instruction to obtain the processed target voice information, and displaying the processed target voice information in the display interface includes: determining an insertion node in the target voice information according to the control voice; and an insertion node is marked on a first time axis; responding to the operation of recording the voice again, and collecting new voice information; displaying a third time axis corresponding to the new voice information in a display interface; and inserting the new voice information into the target voice information according to the insertion node to obtain the processed target voice information, and displaying the processed target voice information in a display interface.

Determining an insertion node in the target voice information according to the control voice; and an insertion node is marked on a first time axis; responding to the operation of recording the voice again, and collecting new voice information; according to the insertion node, the new speech information is inserted into the target speech information, and the step of obtaining the processed target speech information may refer to example 2 in the first embodiment, which is not described herein again.

The operation of re-recording the voice may be an operation of clicking a voice input entry.

The third time axis corresponding to the new voice message may have a display effect, as shown in fig. 5b, where the gray part of the time axis is used to indicate the difference part.

If the processing mode corresponding to the target processing instruction is to slow down the playing speed, the part of the target voice information, the playing speed of which needs to be adjusted, is a difference part, and the other parts can be overlapping parts; the processed target voice information has the adjusted playing speed part as the difference part, the other parts can be the overlapping parts, and the display effect can be as shown in fig. 5 c.

In an embodiment of the present invention, the step of performing corresponding processing on the target voice information according to the target processing instruction to obtain the processed target voice information, and displaying the processed target voice information in the display interface may include: determining a voice segment to be processed in the target voice information as a target voice segment according to the control voice; displaying a time interval corresponding to the target voice clip in a first time axis; and correspondingly processing the target voice fragment according to the target processing instruction to obtain processed target voice information, and displaying the processed target voice information in a display interface.

Determining a voice segment to be processed in the target voice information as a target voice segment according to the control voice; the step of performing corresponding processing on the target voice segment according to the target processing instruction to obtain the processed target voice information may refer to steps S3100 to S3200 in the first embodiment, which is not described herein again.

The target voice clip may be a portion corresponding to a difference between the target voice information and the processed target voice information, and then, an effect of presenting a time interval corresponding to the target voice clip in the first time axis may be as shown in a gray scale portion in the time axis shown in fig. 5a and 5 c.

In an embodiment of the present invention, the processing method may further include: and responding to the playing request of the voice information, and playing the processed target voice information.

The voice information playing request may be triggered by the name or time axis of the target voice information after the user clicks the processing, or may be triggered by a voice instruction corresponding to the target voice information after the processing, which is input by the user, or may be automatically triggered when the target voice information after the processing is obtained.

In this embodiment, in response to the request for playing the voice information, the step of playing the processed target voice information may include: responding to the playing request, and selecting a voice segment meeting the set requirement from the processed target voice information as a recommended voice segment; displaying a time interval corresponding to the recommended voice clip in a second time axis; and playing the recommended voice clip.

In an example, the manner of displaying the time interval corresponding to the recommended voice clip in the second time axis may refer to the manner of displaying the difference portion, and is not described herein again.

For example, the time interval corresponding to the recommended voice clip may be made different from the display color of the difference part in the corresponding time axis.

< third embodiment >

In the present embodiment, a method for processing voice information is provided. The processing method of the voice information can be implemented by the terminal equipment. The terminal device can be any electronic product with a voice acquisition function, for example, an intelligent sound box, an intelligent television, a recording pen, a video camera and the like.

The method for processing the voice information of the embodiment may include:

and responding to the control voice input by the user, and playing the processed target voice information obtained according to the control voice.

The processed target voice information may be voice information obtained by the terminal device obtaining target voice information to be processed and a corresponding target processing instruction according to control voice input by a user and correspondingly processing the target voice information according to the target processing instruction.

< apparatus embodiment >

In this embodiment, a device 6000 for processing voice information is provided, as shown in fig. 6, including a control voice receiving module 6100, an information instruction obtaining module 6200, and an information processing module 6300. The control voice receiving module 6100 is configured to receive a control voice input by a user; the information instruction obtaining module 6200 is configured to obtain target voice information to be processed and a corresponding target processing instruction according to the control voice; the information processing module 6300 is configured to perform corresponding processing on the target voice information according to the target processing instruction, so as to obtain processed target voice information.

In one embodiment, obtaining the target voice information comprises:

converting the control voice into a corresponding control text, and extracting attribute keywords from the control text according to a pre-constructed attribute word bank; wherein, the attribute keywords at least comprise names and/or time;

and acquiring target voice information according to the attribute keywords.

In one embodiment, fetching the target processing instruction comprises:

converting the control voice into a corresponding control text, extracting instruction keywords from the target voice text according to a pre-constructed instruction word bank, and performing structural analysis on the instruction keywords through a structural model to obtain a processing instruction corresponding to the instruction keywords as a target processing instruction;

In one embodiment, the information processing module 6300 may be further configured to:

and carrying out corresponding processing on the target voice fragment according to the target processing instruction to obtain processed target voice information.

In one embodiment, determining a speech segment to be processed in the target speech information according to the control speech includes:

acquiring a first voice oscillogram corresponding to target voice information;

according to the positioning oscillogram, determining a waveform segment to be processed in the first voice oscillogram as a target waveform segment;

and obtaining a target voice segment according to the target waveform segment.

In one embodiment, the positioning waveform comprises a first positioning waveform and a second positioning waveform;

according to the positioning waveform map, determining a waveform segment to be processed in the first voice waveform map as a target waveform segment comprises:

and taking the waveform segment between the first waveform segment and the second waveform segment as a target waveform segment.

and determining a target voice segment in the target voice information according to the time key words.

In one embodiment, the processing mode corresponding to the target processing instruction at least includes: noise reduction processing, volume adjustment processing, mosaic processing, play speed adjustment processing, and/or deletion processing.

In an embodiment, the processing mode corresponding to the target processing instruction is insertion processing, and the information processing module 6300 may further be configured to:

determining an insertion node according to the control voice;

In one embodiment, the processing device 6000 may further include:

and a module for responding to the play request of the voice information and playing the processed target voice information.

In one embodiment, the module for playing the processed target voice message in response to the request for playing the voice message may further be configured to:

and playing the recommended voice clip.

In one embodiment, the processing apparatus may further include:

and the module is used for storing the processed target voice information.

In one embodiment, the processing apparatus may further include:

and replacing the saved processed target voice information with the pre-processed target voice information in response to the revocation processing request input by the user.

The processing means 6000 of speech information can be implemented in various ways, as will be clear to a person skilled in the art. The processing means 6000 of the speech information can be realized, for example, by instructing the configuration processor. For example, the instructions may be stored in a ROM and read from the ROM into a programmable device when the apparatus is started up to implement the processing means 6000 of the speech information. For example, the processing device 6000 for the speech information can be solidified into a dedicated device (for example, ASIC). The processing means 6000 of the speech information can be divided into separate units or they can be combined together. The processing means 6000 of the speech information may be realized by one of the various implementations described above, or may be realized by a combination of two or more of the various implementations described above.

In this embodiment, the processing device 6000 for voice information may have various implementation forms, for example, the processing device 6000 for voice information may be any functional module running in a software product or an application program providing a network access service, or a peripheral insert, a plug-in, a patch, etc. of the software product or the application program, and may also be the software product or the application program itself.

< electronic apparatus >

In this embodiment, an electronic device 7000 is also provided. The electronic device 7000 may be the electronic device 1000 shown in fig. 1.

In one aspect, the electronic device 7000 may comprise the aforementioned processing apparatus 4000 for voice information, which is configured to implement the method for processing voice information according to any embodiment of the present invention.

In another aspect, as shown in FIG. 7, electronic device 7000 may also include processor 7100 and memory 7200, the memory 7200 for storing executable instructions; the processor 7100 is configured to operate the electronic device 7000 according to the control of the instructions to perform the method of processing voice information according to any of the embodiments of the present invention.

In this embodiment, the electronic device 7000 may be a smart speaker, an earphone, a mobile phone, a tablet computer, a palm computer, a desktop computer, a notebook computer, a workstation, a game console, or the like. For example, the electronic device 7000 may be an electronic product having a voice control function.

< computer-readable storage Medium >

In the present embodiment, there is also provided a computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, realizing the processing method of voice information according to any embodiment of the present invention.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method for processing voice information comprises the following steps:

receiving a control voice input by a user;

2. The processing method according to claim 1, wherein the step of acquiring the target voice information comprises:

and acquiring the target voice information according to the attribute keywords.

3. The processing method of claim 1, wherein the step of fetching the target processing instruction comprises:

4. The processing method according to claim 1, wherein the step of performing corresponding processing on the target voice information according to the target processing instruction to obtain processed target voice information comprises:

5. The processing method according to claim 4, wherein the step of determining the voice segment to be processed in the target voice information as the target voice segment according to the control voice comprises:

6. The process of claim 5, wherein the positioning waveform map comprises a first positioning waveform map and a second positioning waveform map;

7. The processing method according to claim 4, wherein the step of determining the voice segment to be processed in the target voice information as the target voice segment according to the control voice comprises:

8. The processing method according to any one of claims 1 to 7, wherein the processing manner corresponding to the target processing instruction at least includes: noise reduction processing, volume adjustment processing, mosaic processing, play speed adjustment processing, and/or deletion processing.

9. The processing method according to claim 1, wherein the processing mode corresponding to the target processing instruction is an insertion processing,

10. The processing method of claim 1, wherein the processing method further comprises:

11. The processing method according to claim 10, wherein the step of playing the processed target voice information in response to a request for playing the voice information comprises:

and playing the recommended voice clip.

12. The processing method of claim 1, wherein the processing method further comprises:

and storing the processed target voice information.

13. The processing method according to claim 12, wherein the correspondingly processing the target speech information according to the target processing instruction, and after obtaining the processed target speech information, further comprises:

14. A method for processing voice information comprises the following steps:

receiving a control voice input by a user;

acquiring target voice information to be processed and a corresponding target processing instruction according to the control voice, and displaying the target voice information and the target processing instruction in a display interface;

and correspondingly processing the target voice information according to the target processing instruction to obtain the processed target voice information, and displaying the processed target voice information in the display interface.

15. The processing method according to claim 14,

the step of displaying the target voice information comprises the following steps: displaying a first time axis corresponding to the target voice information;

the step of displaying the processed target voice information comprises the following steps: and displaying a second time axis corresponding to the processed target voice information.

16. The process of claim 15, wherein presenting the method further comprises:

determining an overlapping portion and a difference portion between the target voice information and the processed target voice information, and indicating the overlapping portion and/or the difference portion in the first time axis and/or the second time axis.

17. The processing method according to claim 15, wherein the step of performing corresponding processing on the target voice information according to the target processing instruction to obtain processed target voice information, and displaying the processed target voice information in the display interface comprises:

displaying a time interval corresponding to the target voice clip in the first time axis;

and correspondingly processing the target voice fragment according to the target processing instruction to obtain the processed target voice information, and displaying the processed target voice information in the display interface.

18. The processing method according to claim 15, wherein the processing mode corresponding to the target processing instruction is an insertion processing,

the step of performing corresponding processing on the target voice information according to the target processing instruction to obtain processed target voice information, and displaying the processed target voice information in the display interface comprises the following steps:

determining an insertion node in the target voice information according to the control voice; and identifying the insertion node in the first time axis;

responding to the operation of recording the voice again, and collecting new voice information; displaying a third time axis corresponding to the new voice information in the display interface;

and inserting the new voice information into the target voice information according to the insertion node to obtain the processed target voice information, and displaying the processed target voice information in the display interface.

19. The processing method of claim 14, wherein the processing method further comprises:

displaying a plurality of preset processing instructions in the display interface, wherein the preset processing instructions comprise the target processing instruction;

the step of presenting the target processing instruction comprises:

and highlighting the target processing instruction.

20. The processing method of claim 15, wherein the processing method further comprises:

21. The processing method according to claim 20, wherein the step of playing the processed target voice information in response to a request for playing the voice information comprises:

displaying a time interval corresponding to the recommended voice clip in the second time axis;

and playing the recommended voice clip.

22. A processing method of voice information is implemented by a terminal device, and the method comprises the following steps:

23. The method of claim 22, wherein the terminal device is a smart speaker, a smart television, a recording pen, or a camcorder.

24. An apparatus for processing voice information, comprising:

25. An electronic device, comprising:

the processing device of claim 24; alternatively, the first and second electrodes may be,

a processor and a memory for storing instructions for controlling the processor to perform a processing method according to any one of claims 1 to 23.

26. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the processing method of any one of claims 1 to 23.