CN113936697B

CN113936697B - Voice processing method and device for voice processing

Info

Publication number: CN113936697B
Application number: CN202010664626.8A
Authority: CN
Inventors: 崔文华; 李健涛; 路呈璋
Original assignee: Beijing Sogou Intelligent Technology Co Ltd
Current assignee: Beijing Sogou Intelligent Technology Co Ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2023-04-18
Anticipated expiration: 2040-07-10
Also published as: CN113936697A

Abstract

The invention provides a voice processing method and a voice processing device, which are used for responding to the knocking operation of a target area in electronic equipment and determining target time in a time axis of a voice file under the condition of recording or playing the voice file; marking a target moment in a time axis; in the voice file, at least one voice fragment adjacent to the target moment is converted into a target text, and the target text is intelligently arranged, so that the target position of the electronic equipment can be quickly and conveniently marked by knocking operation; the marking efficiency is improved, and the marking complexity is reduced. In addition, the target text as the important content is automatically generated when the user marks the target time, the user is not required to repeatedly play the voice file at the target time, and the user is not required to manually record and generate the important content, so that the complexity of generating the important content is greatly reduced, and the timeliness of generating the important content is improved.

Description

Voice processing method and device and voice processing device

Technical Field

The present invention belongs to the field of computer technology, and is especially one kind of speech processing method and apparatus and speech processing apparatus.

Background

In the process of recording or playing a voice file through the electronic device, a user usually has a need to mark important content in the file, so that the important content can be understood later according to the mark.

At present, when recording or playing to a point content time, a common practice is to perform a marking operation by pressing a physical key of a body of an electronic device or a virtual key on a screen, thereby realizing marking of the point content time. After the marking is finished, the user can repeatedly play the key content according to the marked time so as to pick up and record the key content.

Disclosure of Invention

However, the inventor finds that, in the current scheme, for the operation of the physical key, the position of the physical key needs to be memorized in advance, and the physical key can be marked after being accurately operated; the operation of the virtual key is complicated, the screen needs to be lightened firstly and then the operation is carried out, and the voice playing needs to be carried out repeatedly when the highlight content is picked, so that the processing efficiency is low.

Based on the above, the invention provides a voice processing scheme, which is used for quickly and conveniently completing marking by performing knocking operation on any position of electronic equipment; the marking efficiency is improved, and the marking complexity is reduced. In addition, the target text as the important content is automatically generated when the user marks the target time, the summary of the important content is intelligently arranged, the user is not required to repeatedly play the voice file at the target time, the user is not required to manually record and generate the important content, the complexity of generating the important content is greatly reduced, and the timeliness of generating the important content is improved.

The invention also provides a voice processing device for ensuring the realization and the application of the method in practice.

The embodiment of the invention provides a voice processing method, which comprises the following steps:

under the condition of recording or playing a voice file, responding to a knocking operation on a target area in the electronic equipment, and determining a target moment in a time axis of the voice file, wherein the voice file comprises a plurality of voice clips;

marking the target time in the time axis;

and in the voice file, converting at least one voice fragment adjacent to the target moment into a target text, and intelligently sorting the target text.

A plurality of timestamps are arranged on the time axis of the voice file, and voice clips between two adjacent timestamps form a sentence;

in the voice file, converting at least one voice segment adjacent to the target time into a target text, including:

extracting a plurality of target time stamps adjacent to the target time from the voice file;

acquiring a target voice fragment between adjacent target timestamps;

and converting the target voice fragment into a sentence text to obtain the target text.

Wherein after said marking the target time instant in the timeline, the method further comprises:

and responding to a reply instruction of the target moment, and starting playing the voice file from the target moment.

Wherein, the intelligent arrangement of the target text comprises the following steps:

performing word segmentation processing on the target text to obtain a plurality of words;

matching the multiple participles with a preset keyword template, and displaying target participles matched with the keywords in the keyword template as summaries.

and performing semantic intention recognition on the target text, and displaying a recognition result as a summary.

Wherein the electronic device comprises: the touch control integrated circuit comprises a display screen and a touch control integrated circuit which is contacted with one surface of the display screen;

the responding to the knocking operation of the target area in the electronic equipment comprises the following steps:

and sensing the knocking operation of the display screen area of the electronic equipment through the touch integrated circuit, and responding to the knocking operation by the electronic equipment.

Wherein the tapping operation comprises: and continuously knocking for multiple times, wherein the time interval between two adjacent times of knocking is less than or equal to a preset time threshold.

Wherein after said marking the target time instant, the method further comprises:

and generating and displaying a mark success notice.

And the target time is the time when the execution of the knocking operation is finished.

An embodiment of the present invention further provides a speech processing apparatus, where the apparatus includes:

the determining module is used for responding to the knocking operation of a target area in the electronic equipment under the condition of recording or playing a voice file, and determining a target moment in a time axis of the voice file, wherein the voice file comprises a plurality of voice clips;

the marking module is used for marking the target time in the time axis;

and the transcription module is used for converting at least one voice fragment adjacent to the target moment into a target text in the voice file and intelligently sorting the target text.

A plurality of timestamps are arranged on the time axis of the voice file, and voice clips between two adjacent timestamps form a sentence; the transfer module comprises:

the extraction submodule is used for extracting a plurality of target timestamps adjacent to the target time from the voice file;

the acquisition submodule is used for acquiring a target voice segment between the adjacent target timestamps;

and the conversion sub-module is used for converting the target voice fragment into a sentence text to obtain the target text.

Wherein the apparatus further comprises:

and the reply module is used for responding to the reply instruction of the target moment and starting playing the voice file from the target moment.

Wherein the transcription module comprises:

the word segmentation sub-module is used for carrying out word segmentation processing on the target text to obtain a plurality of words;

and the matching sub-module is used for matching the multiple participles with a preset keyword template and taking the target participles matched with the keywords in the keyword template as summary display.

Wherein the transcription module comprises:

and the semantic recognition submodule is used for performing semantic intention recognition on the target text and displaying a recognition result as a summary.

Wherein the electronic device comprises: the touch control integrated circuit comprises a display screen and a touch control integrated circuit which is contacted with one surface of the display screen; the determining module includes:

and the second response submodule is used for sensing the knocking operation of the display screen area of the electronic equipment through the touch integrated circuit and responding to the knocking operation by the electronic equipment.

Wherein the apparatus further comprises:

and the notification module is used for generating and displaying the mark success notification.

An embodiment of the present invention also provides an apparatus for speech processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for:

marking the target time in the time axis;

Embodiments of the present invention also provide a computer-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform one or more speech processing methods as described above.

In the embodiment of the invention, in the process of marking the target time of the voice file, the position of a physical key of the electronic equipment does not need to be memorized in advance, and a user can quickly and conveniently finish marking by knocking any position of the electronic equipment; in addition, the embodiment of the invention does not need to carry out complex man-machine interaction on the screen, so that in the aspect of marking operation, the embodiment of the invention improves the marking efficiency and reduces the marking complexity.

In addition, after the marking of the target time is finished, the electronic equipment can automatically convert one or more voice fragments adjacent to the target time into the target text, so that the target text serving as the important content can be automatically generated when the user marks the target time, the summary of the important content can be intelligently arranged, the user does not need to repeatedly play the voice file at the target time, the user does not need to manually record and generate the important content, the complexity of the generation of the important content is greatly reduced, and the timeliness of the generation of the important content is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a flow chart illustrating steps of a method for processing speech according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 3 is a schematic cross-sectional structure diagram of an electronic device according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a text structure of a voice file according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating specific steps of a method for processing speech according to an embodiment of the present invention;

FIG. 6 is an interface diagram provided by an embodiment of the present invention;

fig. 7 is a block diagram of a speech processing apparatus according to an embodiment of the present invention;

FIG. 8 is a block diagram illustrating an apparatus 800 for speech processing in accordance with an exemplary embodiment of the present invention;

fig. 9 is a schematic structural diagram of a server in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention is operational with numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multi-processor apparatus, distributed computing environments that include any of the above devices or equipment, and the like.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is a flowchart illustrating steps of a speech processing method according to an embodiment of the present invention, where as shown in fig. 1, the method may include:

step 101, responding to the knocking operation of the target area in the electronic equipment under the condition of recording or playing the voice file.

Wherein the voice file comprises a plurality of voice segments.

In the embodiment of the present invention, the electronic device may include a recording device, a broadcasting device, a mobile phone, a personal computer, a tablet computer, a wearable device, and the like, and a microphone may be integrated in the electronic device to perform a voice recording operation, so as to generate a voice file according to the voice recording operation. In addition, the voice file may also be a file downloaded from a database, other electronic devices, a cloud server, or the internet, which is not limited in the present invention.

If the voice file is a recorded file, a time axis exists in the voice file, and the time axis is used for reflecting each moment of voice, for example, if the duration of one voice file is 3 minutes and 30 seconds, the time axis can start from 0 minute and 0 second until 3 minutes and 30 seconds end, and each voice frame of the voice file has a corresponding moment on the time axis.

If the voice file is the voice which is being recorded, a time axis which starts from 0 minute and 0 second is established for the voice file when the recording operation starts, and the time is continuously added on the time axis along with the recording operation until the recording is finished.

Specifically, under the condition of recording or playing a voice file, a user has a requirement for marking a key moment so as to acquire important content corresponding to the key moment.

Optionally, the target time is a time when the execution of the tapping operation is finished.

The target time is determined in the time axis of the voice file, specifically, based on all the times on the time axis of the voice file, the time when the execution of the tapping operation is finished is taken as the target time, and if the tapping operation is finished when the tapping operation is positioned at 2 minutes and 20 seconds of the time axis in the process of recording the voice file, the target time is considered to be the time of 2 minutes and 20 seconds on the time axis.

Further, the electronic device may be integrally provided with a sensor, which is capable of sensing a tapping operation of a user on a target area in the electronic device, so that the electronic device can generate a response to the tapping operation. The target area may be any area of the electronic device.

In an implementation manner, optionally, referring to fig. 2, which shows a schematic structural diagram of an electronic device provided in an embodiment of the present invention, an electronic device 10 includes: a housing 11 and an acceleration sensor 12 provided inside the housing 11; step 101, when implemented, may include the following step A1:

step A1, a knocking operation of a shell area of the electronic device is sensed through the acceleration sensor, and the electronic device responds to the knocking operation.

In this implementation, referring to fig. 2, the target area may be an area where the housing 11 of the electronic device 10 is located, and since the acceleration sensor 12 is disposed in an internal cavity of the housing 11 of the electronic device 10, the acceleration sensor 12 may sense force feedback of a tapping operation applied to the housing 11 by a user and generate a corresponding acceleration signal, so that the electronic device 10 may perform an operation of determining a target time in a time axis of the voice file in response to the tapping operation according to the acceleration signal.

Specifically, the acceleration sensor may include a gravity gyroscope, the gravity gyroscope may measure current gravity and angular velocity of the electronic device, and when the user performs a tapping operation on the housing of the electronic device, the gravity gyroscope may detect changes in the gravity and the angular velocity, so as to obtain a signal that triggers the electronic device to perform a subsequent operation.

In another implementation manner, optionally, referring to fig. 3, which shows a schematic cross-sectional structure diagram of an electronic device provided in an embodiment of the present invention, the electronic device 10 includes: a display screen 13 and a touch integrated circuit 14 in contact with one surface of the display screen 13; step 101, when implemented, may include the following step A2:

and A2, sensing a knocking operation on a display screen area of the electronic equipment through the touch integrated circuit, and responding to the knocking operation by the electronic equipment.

In this implementation manner, referring to fig. 3, the target area may be an area where a display 13 of the electronic device 10 is located, the electronic device 10 may be a display 13 with a touch function, a touch integrated circuit 14 is disposed on one side of the display 13, and a pressure sensor is disposed on the touch integrated circuit 14, and may sense force feedback of a tapping operation applied to the display 13 by a user and generate a corresponding pressure signal, so that the electronic device 10 may perform, according to the pressure signal, an operation of determining a target time in a time axis of a voice file after responding to the tapping operation. Specifically, the touch integrated circuit may be implemented in the form of a touch chip.

Optionally, the knocking operation includes: and continuously knocking for multiple times, wherein the time interval between two adjacent times of knocking is less than or equal to a preset time threshold.

In the embodiment of the present invention, the electronic device may misjudge other operations (e.g., a click operation) of the electronic device by the user, so that the electronic device performs the target time marking at an incorrect time. In order to reduce the erroneous judgment of the electronic device caused by other operations, the tapping operation may be set to include at least two taps, and the time interval between two adjacent taps is less than or equal to a preset time threshold, for example, the preset time threshold may be set to 10 microseconds, so that the probability of the erroneous judgment of the electronic device caused by normal operations is greatly reduced. In addition, the preset time threshold may also be set to other values according to actual requirements, which is not limited in the embodiment of the present invention.

For example, the tapping operation can comprise 3 taps in succession, and the time interval between two adjacent taps is less than 10 microseconds, so that the tapping operation is unique and can be well distinguished from other operations, and misjudgment of other operations by the electronic equipment can be reduced.

And 102, marking the target time in the time axis.

In the embodiment of the application, after the target time in the time axis of the voice file is determined, the target time can be marked. In one implementation, the target time may be tagged with a tag, which may be a timestamp, to implement the tagging of the target time.

It should be noted that, after the target time is marked, the user may add remark content corresponding to the target time to the mark, for example, the remark content may be "speak a meeting place here, please know".

In the embodiment of the invention, compared with the scheme of marking the key moment through the physical keys of the electronic equipment, the embodiment of the invention does not need to memorize the positions of the physical keys in advance, and in addition, a user can perform the knocking operation on any position of the electronic equipment and does not need to perform the marking operation aiming at the physical keys at a certain fixed position; compared with the scheme of marking the key moment through the virtual keys of the electronic equipment, the embodiment of the invention does not need to carry out complex human-computer interaction on the screen. Therefore, in the aspect of marking operation, the embodiment of the invention improves the marking efficiency and reduces the marking complexity.

And 103, converting at least one voice segment adjacent to the target moment into a target text in the voice file, and intelligently sorting the target text.

After the marking of the target moment is finished, because the target moment is the key moment considered by the user, when the voice file is played to the target moment, an important voice segment with high value for the user exists, the voice segment can be converted into a target text to be stored or provided for the user, so that the target text serving as important content can be automatically generated when the user marks the target moment, the user does not need to repeatedly play the voice file at the target moment, the user does not need to manually record and generate the important content, the complexity of generating the important content is greatly reduced, and the timeliness of generating the important content is improved. In addition, due to the timeliness of the mark of the target time, the mark of the target time is more accurate, and the extraction accuracy of important contents is further improved.

Further, after the target text is obtained for the target moment, the target text may be intelligently sorted, for example, the key point of the target text is automatically extracted, and the key point content is used as a summary. In addition, intelligent arrangement can also comprise automatic translation of the target text into a plurality of preset voices.

It should be noted that the voice file includes a plurality of voice segments, that is, the voice file may be divided into a plurality of voice segments according to a preset rule.

In an implementation manner, optionally, a plurality of timestamps are set on a time axis of the voice file, and a voice clip between two adjacent timestamps forms a statement; step 103, when implemented, may include the following steps B1-B3:

and B1, extracting a plurality of target time stamps adjacent to the target time from the voice file.

In this implementation, referring to fig. 4, which shows a text structure diagram of a voice file according to an embodiment of the present invention, a sentence of a user voice recorded in the voice file 20 may be divided according to a user speech speed, that is, when a sentence break of the user voice is detected (that is, a pause time between two adjacent words is greater than or equal to a preset time, for example, 0.5 second), a timestamp 21 is added to the sentence break, and after all timestamps 21 are added, a sentence is formed between two adjacent timestamps 21. In fig. 4, a total of 5 time stamps are marked, so that 5 time stamps result in 4 speech segments. The time on the time axis corresponding to the key timestamp 22 is the target time.

Further, assuming that 3 target timestamps are extracted, in one implementation, two timestamps 21 to the left and one timestamp 21 to the right of the key timestamp 22 may be extracted. In another implementation, one timestamp 21 to the left and two timestamps 21 to the right of the key timestamp 22 may be extracted. The number of the extracted target timestamps may be set according to actual requirements, which is not limited in the embodiment of the present invention.

It should be noted that, in other implementation manners, 3 timestamps 21 may be extracted from the left or right of the key timestamp 22 as the target timestamp, which is not limited in the present invention. In addition, in other implementation manners, the voice file may also be divided into a plurality of voice segments according to other rules, for example, a word may be segmented into voice contents, so that one word corresponds to one voice segment.

And B2, acquiring a target voice fragment between the adjacent target time stamps.

In this step, referring to fig. 4, in one implementation, after extracting two timestamps on the left side and one timestamp on the right side of the key timestamp 22, a target speech segment corresponding to the adjacent target timestamp, that is, a speech segment corresponding to the second speech and the third speech in the speech file 20, may be constructed according to the content between the adjacent target timestamps, as the target speech segment.

In another implementation, after extracting one timestamp to the left and two timestamps to the right of the key timestamp 22, the voice segments corresponding to the third sentence and the fourth sentence in the voice file 20 can be taken as the target voice segments.

And step B3, converting the target voice fragment into a sentence text to obtain the target text.

In this step, after the target speech segment is obtained, the speech content in the target speech segment may be converted into the target text by the speech-to-text technology. After the target text is obtained, the target text can be intelligently sorted and stored in the electronic equipment for the user to look up, and in addition, the target text can be directly fed back to the user in a notification mode, so that the timeliness of the user for looking up the important content is improved.

In summary, in the process of marking the target time of the voice file, the user does not need to memorize the position of the physical key of the electronic device in advance, and can quickly and conveniently complete the marking by knocking any position of the electronic device; in addition, the embodiment of the invention does not need to carry out complex man-machine interaction on the screen, so in the aspect of marking operation, the embodiment of the invention improves the marking efficiency and reduces the marking complexity.

In addition, after the target time is marked, the electronic equipment can automatically convert one or more voice segments adjacent to the target time into the target text, so that the target text serving as important content can be automatically generated when a user marks the target time, the target text can be intelligently sorted, the user does not need to repeatedly play the voice file at the target time, the user does not need to manually record and generate the important content, the complexity of generating the important content is greatly reduced, and the timeliness of generating the important content is improved.

Fig. 5 is a flowchart illustrating specific steps of a speech processing method according to an embodiment of the present invention, and as shown in fig. 5, the method may include:

step 201, under the condition of recording or playing a voice file, responding to the knocking operation of the electronic equipment, and determining a target time in a time axis of the voice file.

And the target moment is the moment when the execution of the knocking operation is finished, and the voice file comprises a plurality of voice fragments.

This step may specifically refer to step 101, which is not described herein again.

Step 202, marking the target time in the time axis.

This step may refer to step 102, which is not described herein again.

And 203, converting at least one voice fragment adjacent to the target moment into a target text in the voice file, and intelligently sorting the target text.

This step may specifically refer to step 102, which is not described herein again.

Optionally, in an implementation manner, step 203 may specifically include:

substep 2031, performing word segmentation processing on the target text to obtain a plurality of word segments.

In the embodiment of the invention, word segmentation is an important processing means in the natural language processing technology, and the word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. The method adopted by the word segmentation processing can comprise the following steps: character matching, understanding and statistical methods.

Substep 2032, matching the multiple participles with a preset keyword template, and displaying the target participles matched with the keywords in the keyword template as a summary.

In the embodiment of the invention, in order to refine important contents in the target text, a keyword template comprising a plurality of keywords can be established according to actual requirements, and after the target text is obtained, the target text is decomposed into a plurality of word segments so as to be convenient for processing the target text.

In the step, one or more target participles can be selected from the multiple participles by matching the multiple participles with a preset keyword template, the target participles are words matched with the keywords in the keyword template, and the target participles are displayed as summaries, so that the purpose of intelligently sorting the target texts to obtain the summaries can be achieved, and the user experience is improved.

For example, for an administrative staff to prepare a working scene of a conference, the keyword template may specifically include types of keywords such as a place, a time, and a conference, assuming that the target text is: "meeting in conference room No. 2, 13 pm today. The target word segmentation available for the target text includes: "13: 00, no. 2 meeting room and conference opening, and the target participles are used as summaries to be displayed, so that the quick absorption capacity of administrative staff on important contents can be effectively improved, and the operation efficiency is improved.

Optionally, in another implementation manner, step 203 may specifically include:

substep 2033, performing semantic intention recognition on the target text, and displaying the recognition result as a summary.

In another implementation mode, the target text can be processed through the semantic recognition model, the semantic intention corresponding to the target text is output and used as a recognition result of the target text, so that the key point of the target text is rapidly and automatically extracted, and the intelligent sorting of the target text is realized. The semantic recognition model can be realized based on a deep learning model, such as a convolutional neural network model.

Optionally, after step 202, the method may further include:

and step 204, responding to the reply instruction of the target moment, and starting to play the voice file from the target moment.

In the embodiment of the present invention, after completing marking of a target time, if a user needs to repeatedly listen to the content of the target time in a voice file, the user may perform a reply operation on an electronic device for the target time, for example, if there are multiple target times, the user may select a certain target time and click a reply button of the target time, so that the electronic device responds to a reply instruction for the target time, and opens the voice file and starts playing the voice file from the target time when the voice file is not opened; and under the condition that the voice file is playing, the voice file is played from the target moment again.

Optionally, after step 202, the method may further include:

and step 205, generating and displaying a mark success notice.

In the embodiment of the invention, after a target moment is marked, the electronic equipment can generate a mark success notice and display the notice so as to inform a user that the target moment is marked.

For example, referring to fig. 6, which shows an interface diagram provided by the embodiment of the present invention, in the voice file recording interface 30 of the electronic device, if the user completes the mark by a tapping operation at 4 minutes and 25 seconds, in the notification bar 31 of the voice file recording interface 30, a presentation of a mark success notification may be performed, and the mark success notification may include a mark success icon and a text "one mark has been generated".

In conclusion, in the process of marking the target time of the voice file, the positions of the physical keys of the electronic equipment do not need to be memorized in advance, and a user can quickly and conveniently finish marking by knocking any position of the electronic equipment; in addition, the embodiment of the invention does not need to carry out complex man-machine interaction on the screen, so that in the aspect of marking operation, the embodiment of the invention improves the marking efficiency and reduces the marking complexity. After the marking is finished, the user can play the voice file from the target moment quickly through the reply instruction aiming at the target moment, and the speed of reading important contents by the user is improved.

In addition, after the marking of the target time is finished, the electronic equipment can automatically convert one or more voice segments adjacent to the target time into the target text, so that the target text serving as important content can be automatically generated when the user marks the target time, the summary of the important content can be intelligently arranged, the user does not need to repeatedly play the voice file at the target time, the user does not need to manually record and generate the important content, the complexity of generating the important content is greatly reduced, and the timeliness of generating the important content is improved.

For simplicity of explanation, the foregoing method embodiments are presented as a series of interrelated acts, but it should be appreciated by those skilled in the art that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the present invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Corresponding to the method provided by the foregoing embodiment of the speech processing method of the present invention, referring to fig. 7, the present invention further provides an embodiment of a speech processing apparatus, and in this embodiment, the apparatus may include:

a determining module 301, configured to determine a target time in a time axis of a voice file in response to a tapping operation on a target area in the electronic device when the voice file is recorded or played, where the voice file includes a plurality of voice clips;

a marking module 302, configured to mark the target time in the time axis;

and the transcription module 303 is configured to convert at least one voice segment adjacent to the target time into a target text in the voice file, and intelligently arrange the target text.

Wherein the transfer module 303 comprises:

Wherein, the transferring module 303 comprises:

and the semantic recognition submodule is used for carrying out semantic intention recognition on the target text and displaying the recognition result as a summary.

Wherein the apparatus further comprises:

Wherein the electronic device comprises: the acceleration sensor comprises a shell and an acceleration sensor arranged in the shell; the determining module 301 includes:

and the first response submodule is used for inducing the knocking operation of the shell area of the electronic equipment through the acceleration sensor and responding to the knocking operation by the electronic equipment.

Wherein the electronic device comprises: the touch control integrated circuit comprises a display screen and a touch control integrated circuit which is contacted with one surface of the display screen; the determining module 301 includes:

Wherein the apparatus further comprises:

In conclusion, in the process of marking the target time of the voice file, the position of the physical key of the electronic equipment does not need to be memorized in advance, and the user can quickly and conveniently finish marking by knocking any position of the electronic equipment; in addition, the embodiment of the invention does not need to carry out complex man-machine interaction on the screen, so that in the aspect of marking operation, the embodiment of the invention improves the marking efficiency and reduces the marking complexity. After the marking is finished, the user can play the voice file from the target moment quickly through the reply instruction aiming at the target moment, and the speed of reading important contents by the user is improved.

In addition, after the marking of the target time is finished, the electronic equipment can automatically convert one or more voice fragments adjacent to the target time into the target text, so that the target text serving as important content can be automatically generated when a user marks the target time, the target text can be intelligently sorted, the user does not need to repeatedly play the voice file at the target time, the user does not need to manually record and generate the important content, the complexity of generating the important content is greatly reduced, and the timeliness of generating the important content is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 8 is a block diagram illustrating a speech processing apparatus 800 according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The apparatus 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of speech processing, the method comprising: in the case of recording or playing a voice file, in response to a tapping operation on a target area in the electronic equipment, determining a target moment in a time axis of the voice file, wherein the voice file comprises a plurality of voice fragments; marking the target time in the time axis; and in the voice file, converting at least one voice fragment adjacent to the target moment into a target text, and intelligently sorting the target text.

extracting a plurality of target timestamps adjacent to the target time from the voice file;

acquiring a target voice fragment between adjacent target timestamps;

Wherein the electronic device comprises: the device comprises a shell and an acceleration sensor arranged in the shell;

and inducing a knocking operation on a shell area of the electronic equipment through the acceleration sensor, and responding to the knocking operation by the electronic equipment.

and inducing the knocking operation of the display screen area of the electronic equipment through the touch integrated circuit, and responding the knocking operation by the electronic equipment.

and generating and displaying a mark success notice.

Fig. 9 is a schematic structural diagram of a server in the embodiment of the present invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A speech processing method applied to electronic equipment is characterized by comprising the following steps:

in the case of recording or playing a voice file, in response to a tapping operation on a target area in the electronic equipment, determining a target moment in a time axis of the voice file, wherein the voice file comprises a plurality of voice fragments; the target area is any area of the electronic equipment; the target time is the time when the execution of the knocking operation is finished;

marking the target time in the time axis;

in the voice file, converting at least one voice fragment adjacent to the target moment into a target text, and intelligently sorting the target text;

a plurality of timestamps are arranged on the time axis of the voice file, and voice clips between two adjacent timestamps form a sentence; in the voice file, converting at least one voice segment adjacent to the target time into a target text, including:

acquiring a target voice fragment between adjacent target timestamps;

converting the target voice fragment into a sentence text to obtain the target text;

the intelligent arrangement of the target text comprises the following steps:

2. The method of claim 1, wherein after said marking said target time in said timeline, said method further comprises:

3. The method of claim 1, wherein the intelligently organizing the target text comprises:

4. The method of claim 1, wherein the electronic device comprises: the touch control integrated circuit comprises a display screen and a touch control integrated circuit which is in contact with one surface of the display screen;

5. The method according to any one of claims 1 to 4, wherein the tapping operation comprises: and continuously knocking for multiple times, wherein the time interval between two adjacent times of knocking is less than or equal to a preset time threshold.

6. A speech processing device applied to electronic equipment is characterized by comprising:

the determining module is used for responding to the knocking operation of a target area in the electronic equipment under the condition of recording or playing a voice file, and determining a target moment in a time axis of the voice file, wherein the voice file comprises a plurality of voice clips; the target area is any area of the electronic equipment; the target time is the time when the execution of the knocking operation is finished;

the marking module is used for marking the target time in the time axis;

the transcription module is used for converting at least one voice fragment adjacent to the target moment into a target text in the voice file and intelligently sorting the target text;

a plurality of timestamps are arranged on the time axis of the voice file, and voice clips between two adjacent timestamps form a sentence; the transcription module is also used for extracting a plurality of target timestamps adjacent to the target time from the voice file; acquiring a target voice fragment between adjacent target timestamps; converting the target voice fragment into a sentence text to obtain the target text; performing word segmentation processing on the target text to obtain a plurality of words; matching the multiple participles with a preset keyword template, and displaying target participles matched with the keywords in the keyword template as summaries.

7. An apparatus for speech processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors the one or more programs including instructions for:

in the case of recording or playing a voice file, in response to a tapping operation on a target area in an electronic device, determining a target moment in a time axis of the voice file, wherein the voice file comprises a plurality of voice fragments; the target area is any area of the electronic equipment; the target time is the time when the execution of the knocking operation is finished;

marking the target time in the time axis;

a plurality of timestamps are arranged on the time axis of the voice file, and voice clips between two adjacent timestamps form a sentence; the converting, in the voice file, at least one voice segment adjacent to the target time into a target text includes:

acquiring a target voice fragment between adjacent target timestamps;

the intelligent arrangement of the target text comprises the following steps:

8. A computer-readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform a speech processing method as recited in one or more of claims 1-5.