CN114556353A

CN114556353A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN114556353A
Application number: CN201980100994.5A
Authority: CN
Inventors: 杨林举
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2022-05-27
Also published as: WO2021119908A1

Abstract

A data processing method, an apparatus, an electronic device and a storage medium, wherein the method comprises: acquiring voice data and carrying out content detection (301) on the voice data; when a command word contained in the voice data is detected, performing corresponding operation (302) on the presentation document according to an instruction corresponding to the command word; the content of the presentation document is associated with the content of the voice data; the presentation document is used for presenting when the voice data is played; and when the speech content contained in the voice data is detected, presenting the speech content when the voice data is played (303).

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to simultaneous interpretation technology, and in particular, to a data processing method, apparatus, electronic device, and storage medium.

Background

With the rapid development of Artificial Intelligence technology, the concept of Artificial Intelligence (AI) has come to the real world from black technology in laboratories, and is applied to all aspects of real life.

The simultaneous transmission system is a speech translation product aiming at a conference scene appearing in recent years, and provides multilingual text translation and text presentation for the speech content of a conference speaker by applying AI technology.

In the related simultaneous transmission system, when a presentation document used by a conference speaker is operated, the operation needs to be performed by the speaker or an assistant, and the operation is inconvenient for the speaker.

Disclosure of Invention

In order to solve the related technical problems, embodiments of the present application provide a data processing method, an apparatus, an electronic device, and a storage medium.

The embodiment of the application provides a data processing method, which comprises the following steps:

acquiring voice data and carrying out content detection on the voice data;

when a command word contained in the voice data is detected, carrying out corresponding operation on a presentation document according to an instruction corresponding to the command word; the content of the presentation document is associated with the content of the voice data; the presentation document is used for presenting when the voice data is played;

and when the speech content contained in the voice data is detected, presenting the speech content when the voice data is played.

In the above scheme, detecting a command word included in the voice data includes:

and inquiring a command word bank according to the voice data, and determining command words meeting a first preset condition in the voice data.

In the above solution, the determining the command word in the voice data that meets the first preset condition includes at least one of:

determining a command word with the similarity of pronunciation of one word in the command word library exceeding a preset threshold in the voice data;

determining a command word in the recognition text which is matched with one word in the command word bank; the recognition text is obtained by performing text recognition on the voice data.

In the above scheme, the performing corresponding operations on the presentation document according to the instruction corresponding to the command word includes:

querying an instruction library according to the command words, and determining target instructions corresponding to the command words; the target instructions characterize operational instructions for a presentation document; the instruction library comprises at least one instruction and command words corresponding to the instructions in the at least one instruction;

and executing corresponding operation on the presentation document according to the target instruction.

In the above scheme, detecting the speech content included in the voice data includes:

and inquiring a command word bank according to the voice data, and determining the speech content which does not accord with a second preset condition in the voice data.

In the foregoing scheme, the determining the content of the speech that does not meet the second preset condition in the voice data includes at least one of:

determining lecture content in the voice data; the similarity between the pronunciation of any word in the speech content and the pronunciation of each word in the command word library is lower than a preset threshold value;

determining a speech text in the recognition text; the matching degree of any word in the speech text and each word in the command word bank is lower than a preset matching degree threshold value; the recognition text is obtained by performing text recognition on the voice data.

In the above scheme, the presenting the speech content when the voice data is played includes:

determining an identification result corresponding to the speech content;

presenting the recognition result when the voice data is played;

wherein the recognition result comprises at least one of: speech text of at least one language, and translation speech data of at least one language.

The embodiment of the present application further provides a simultaneous interpretation device, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is configured to acquire voice data and detect the content of the voice data;

the first processing unit is configured to perform corresponding operation on a presentation document according to an instruction corresponding to a command word when the command word contained in the voice data is detected; the content of the presentation document is associated with the content of the voice data; the presentation document is used for presenting when the voice data is played;

and the second processing unit is configured to present the speech content when the speech data is played when the speech content contained in the speech data is detected.

An embodiment of the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any of the data processing methods when executing the program.

The embodiment of the present application further provides a storage medium, on which computer instructions are stored, and the instructions, when executed by a processor, implement the steps of any of the data processing methods described above.

The data processing method, the data processing device, the electronic equipment and the storage medium provided by the embodiment of the application acquire voice data and perform content detection on the voice data; when a command word contained in the voice data is detected, carrying out corresponding operation on a presentation document according to an instruction corresponding to the command word; the content of the presentation document is associated with the content of the voice data; the presentation document is used for presenting when the voice data is played; when speech content contained in the voice data is detected, presenting the speech content when the voice data is played; therefore, the lecturer can realize corresponding operation through speaking, does not need to manually operate or assist the lecturer to assist in carrying out corresponding operation on the presentation document, improves the lecture efficiency, saves the lecture time, and can improve the user experience.

Drawings

FIG. 1 is a flow chart illustrating a simultaneous interpretation method in the related art;

FIG. 2 is a flow chart illustrating a command execution method in a simultaneous interpretation process according to the related art;

FIG. 3 is a schematic flow chart illustrating a data processing method according to an embodiment of the present application;

FIG. 4 is another schematic flow chart diagram of a data processing method according to an embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating a data processing method according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating a command execution method in the simultaneous interpretation process according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and specific embodiments.

FIG. 1 is a flow chart illustrating a simultaneous interpretation method in the related art; as shown in fig. 1, after the simultaneous transmission server is started, the lecturer uses the simultaneous transmission server to perform a lecture; in the speech process, the simultaneous transmission server acquires the speech data of a speaker, and performs speech recognition on the speech data to obtain a recognition text with the same language as the speech data; performing machine translation on the recognition text to obtain a translation text; the synchronous transmission server judges whether the translation text needs to be synthesized into voice or not, if the voice needs to be synthesized, the voice is synthesized according to the translation text, and the synthesized voice, the recognition text and the translation text are sent out as target results; and if the voice synthesis is determined not to be needed, the recognition text and the translation text are sent out as target results.

Specifically, the synchronous transmission server can collect the voice data through an operation end; the operation end can be a Personal Computer (PC). The synchronous transmission server can send the target result to the operation end, and the operation end projects a screen to a display screen to display the identification text and the translation text; the co-transmission server can also send a target result to a terminal, display a recognition text and a translation text through a human-computer interaction interface of the terminal, and play the synthesized voice through a voice playing module of the terminal; therefore, the target result is displayed for the user, and the speech content of the speaker is translated into the language needed by the user and displayed. Here, the terminal may be a mobile phone, a tablet computer, or the like; the terminal is held by a user.

In the speech process, a speaker can also display a presentation document through the operation end, specifically, the presentation document is projected to a display screen and displayed to a user through the display screen; here, the presentation document may be a presentation software (PPT, PowerPoint) document, a Word document, or the like.

When a speaker needs to play, turn pages, rewind, stop playing, etc. a presentation document, the speaker or an assistant speaker needs to click a corresponding key by using a mouse, a page turning pen, etc. In the process, in order to perform operations such as playing, page turning, backspacing and the like, the speaker needs to stop the speech, perform corresponding operations on the presentation document and then continue the speech, or the speaker needs to send a command to the auxiliary speaker, and the auxiliary speaker performs corresponding operations and then continues the speech.

Specifically, referring to the flowchart shown in fig. 2, as shown in fig. 2, the method for executing a command in the simultaneous interpretation process includes: in the process that a speaker uses the simultaneous transmission system to perform speech, when the speaker needs to perform a certain operation on a presentation document, and when no auxiliary speaker exists, the speaker performs corresponding operations, such as page turning, backspacing and the like, on corresponding equipment (such as the operation terminal), so that the corresponding equipment determines and executes corresponding commands; when an auxiliary lecturer exists, the lecturer sends a command to the auxiliary lecturer, and the auxiliary lecturer performs corresponding operation on corresponding equipment, so that the corresponding equipment determines and executes the corresponding command; after the speaker waits for the corresponding device to execute the corresponding command, the speaker continues to use the simultaneous transmission system to perform the speech. Here, the synchronous transmission system may include the synchronous transmission server, an operation terminal, a display screen, and a terminal.

In the scheme in the related art, when the speaker operates the corresponding equipment (such as the operation aiming at the presentation document), the speaker needs to stop the speech and then perform the corresponding operation on the presentation document, so that the speech time is prolonged, and the experience of audiences is damaged; when there is supplementary speech personnel, the speech personnel need send the order to supplementary speech personnel, carry out corresponding operation by supplementary speech personnel to corresponding equipment, and this still can increase the required human cost of speech.

Based on this, in various embodiments of the present application, voice data is acquired, and content detection is performed on the voice data; when a command word contained in the voice data is detected, carrying out corresponding operation on a presentation document according to an instruction corresponding to the command word; the content of the presentation document is associated with the content of the voice data; the presentation document is used for presenting when the voice data is played; when speech content contained in the voice data is detected, presenting the speech content when the voice data is played; therefore, the speaker can realize corresponding operation on the presentation document through speaking, the speaker does not need to manually operate or assist the speaker to assist in executing the corresponding operation, the speech efficiency is improved, the speech time is saved, and the user experience can be improved.

Fig. 3 is a schematic flow chart of the data processing method according to the embodiment of the present application; as shown in fig. 3, the method includes:

step 301: acquiring voice data and carrying out content detection on the voice data;

step 302: when a command word contained in the voice data is detected, carrying out corresponding operation on a presentation document according to an instruction corresponding to the command word;

here, the content of the presentation document is associated with the content of the voice data; the presentation document is used for presenting when the voice data is played;

step 303: and when the speech content contained in the voice data is detected, presenting the speech content when the voice data is played.

Here, the content presented by the presentation document is associated with the content of the voice data, and refers to a document associated with the content of the speech, such as a Presentation Point (PPT) document, a Word document, and the like, presented by the presenter through the display screen while the presentation document is speaking.

The presentation document is used for presenting when the voice data is played, which means that the presentation document is presented while the voice data is played. Namely, the data processing method can be applied to any meeting scene needing presentation documents, such as: seminar, etc.

The data processing method can be applied to electronic equipment; the electronic device may be a server, a terminal, or the like.

In practical application, the electronic device may be the server, the server projects the presentation document to a display screen for display, and in the speech process, the server receives the voice data and executes the data processing method of the embodiment of the application, so as to implement corresponding operation on the presentation document.

The electronic device can also be a terminal, the presentation document can be displayed by being projected to a display screen by a server, the terminal receives the voice data, executes the data processing method of the embodiment, determines an instruction aiming at the presentation document, sends the corresponding instruction to the server, and the server realizes corresponding operation on the presentation document.

When the electronic equipment is a terminal, the demonstration document can be displayed through a human-computer interaction interface of the terminal; in the speech process, the terminal receives the voice data and executes the data processing method of the embodiment of the application to realize corresponding operation on the demonstration document displayed through the human-computer interaction interface of the terminal.

Here, when a speaker performs a speech, the speaker may collect voice data through an operation end, the operation end (e.g., a mobile terminal such as a PC) may be provided with or connected to a voice collection module, such as a microphone, and the operation end collects sound through the voice collection module to obtain voice data, and sends the voice data to an electronic device (specifically, the server or the terminal), and the electronic device executes the data processing method according to the embodiment of the present application.

In actual application, in order to determine whether a user needs to operate a presentation document, voice data needs to be detected; when the voice data is detected to contain the command words, the user can be determined to need to operate on the presentation document.

Based on this, in an embodiment, detecting a command word included in the voice data includes:

Specifically, the determining of the command word meeting the first preset condition in the voice data includes at least one of:

In practical application, the voice data may be directly detected to determine the command word (voice format) therein, or after voice recognition is performed on the voice data, the recognized text of the same language as the voice data may be determined to determine the command word (text format) in the recognized text.

Accordingly, the command word library includes at least one command word, and each command word corresponds to a representation in a speech format (i.e., pronunciation of the command word) and a representation in a text format (i.e., text of the command word).

In practical applications, when a command word determined based on pronunciation (specifically, a command word in the determined speech data, the similarity of which to the pronunciation of a word in the command word bank exceeds a preset threshold) is the same as a command word determined based on text (specifically, a command word in the determined recognition text, which matches a word in the command word bank), the command word can be directly determined;

the command word determined based on the pronunciation may be different from the command word determined based on the text, and in this case, the target command word needs to be determined from the command word determined based on the pronunciation and the command word determined based on the text.

Specifically, a command word determined based on pronunciation is taken as a first command word; recording the command word determined based on the text as a second command word;

when the first command word is determined to be different from the second command word, determining a first weight and a second weight; the first weight represents the credibility of the command word determined based on pronunciation, and the second weight represents the credibility of the command word determined based on text; weighting the first command word according to the first weight, and weighting the second command word according to the second weight; and selecting one command word from the first command word and the second command word as a target command word according to a weighting processing result.

For example, when the first command word is determined to be command word a and the second command word is also determined to be command word a, command word a is directly used as the target character.

When the first command word is determined to be a command word A and the second command word is determined to be a command word B, determining a first weight and a second weight; weighting the command word A according to the first weight, and weighting the command word B according to the second weight; and obtaining a weighting result of the command word A and a weighting result of the command word B, comparing the weighting result of the command word A with the weighting result of the command word B, and taking the command word A as a target command word if the weighting result of the command word A is greater than the weighting result of the command word B, otherwise, taking the command word B as the target command word.

Here, it is considered that a plurality of first command words may also be obtained based on pronunciation; a plurality of second command words may also be obtained through text; that is, the number of the first command words is at least two, and the number of the second command words is at least two.

Under the corresponding condition, respectively carrying out weighting processing on each first command word in the at least two first command words according to the first weight to obtain a weighting processing result aiming at each first command word;

respectively carrying out weighting processing on each second command word in the at least two second command words according to the second weight to obtain a weighting processing result aiming at each second command word;

determining that the at least two first command words and the at least two second command words do not have the same command word (i.e., each command word in the at least two first command words is different from each command word in the at least two second command words), selecting a command word with the largest weighting processing result from the at least two first command words and the at least two second command words as a target command word according to the weighting processing result of each first command word and the weighting processing result of each second command word;

determining that the at least two first command words and the at least two second command words have the same command word (that is, a command word exists in the at least two first command words and is the same as a command word existing in the at least two second command words), adding the weighted processing results for the same command word to obtain a weighted processing result for each command word; and according to the weighting processing result of each command word, selecting the command word with the largest weighting processing result from the at least two first command words and the at least two second command words as the target command word.

For example, a probability of obtaining a command word a and a command word B based on pronunciation, and obtaining that the command word a is a target command word is a 1%, and the probability of the command word B is B%; here, a 1% + b% may equal 1;

obtaining a command word A and a command word C based on the text, wherein the probability that the command word A is the target command word is a 2%, and the probability that the command word C is the target command word is C%; here, a 2% + c% may equal 1;

assuming that the first weight is x, the second weight is y, and x + y is 1;

the weighting processing result for each command word is as follows:

the weighting processing result of the command word a: a 1%. x + a 2%. y;

weighting processing result of the command word B: b%. x;

weighting processing result of the command word C: c%. y;

and selecting the command word with the largest weighting processing result from the command word A, the command word B and the command word C as a target command word.

After the command word is detected, the electronic device can determine that the user needs to perform corresponding operation on the presentation document, and the electronic device can determine a corresponding operation instruction according to the detected command word to realize control over the presentation document.

Specifically, the performing the corresponding operation on the presentation document according to the instruction corresponding to the command word includes:

inquiring an instruction library according to the command words, and determining target instructions corresponding to the command words; the target instructions characterize operational instructions for a presentation document; the instruction library comprises at least one instruction and command words corresponding to the instructions in the at least one instruction;

And executing corresponding operation on the presentation document through the determined target instruction, namely realizing the control on the presentation document.

For example, when a command word "back" is detected, determining a first control instruction for a presentation document to control the presentation document to back according to the command word "back"; namely, the first control instruction, is used to control the presentation document to move backward (i.e., to turn to the previous page);

when a command word 'page turning' is detected, determining a second control instruction for controlling page turning of the demonstration document aiming at the demonstration document according to the command word 'page turning'; namely, the second control instruction, is used to control the presentation document to page (i.e., to a subsequent page).

The first control instruction may be specifically executed by a program that displays a presentation document, for example, if the display document is a PPT document, the corresponding program may be a Microsoft Office PowerPoint application program;

i.e. the Microsoft Office PowerPoint application determines the first control instruction and performs the corresponding back operation.

The second control instruction is similar to the first control instruction, and the second control instruction may be presented by a program that presents a presentation document, for example, if the presentation document is a PPT document, the corresponding program may be a Microsoft Office PowerPoint application program;

namely, the Microsoft Office PowerPoint application program determines a second control instruction and executes the corresponding page turning operation.

Here, the instruction library may be preset and saved in the electronic device by a developer. The instructions stored in the instruction library and the command words corresponding to the corresponding instructions can be provided for the speaker to view in advance.

Specifically, when the data processing method is applied to a server (that is, the electronic device is a server), the instruction library may be preset by a developer and stored in the server.

When the data processing method is applied to a terminal (namely, the electronic equipment is the terminal), the instruction library can be preset by a developer and stored in a server, and then the set instruction library is sent to the terminal; correspondingly, after the instruction library is updated, the server may send the updated instruction library to the terminal again, and the terminal receives and stores the corresponding instruction library.

The language of the command word can have at least one language, so that the target instruction corresponding to the command word can be obtained by searching an instruction library for speakers using different languages.

At least one expression can be corresponded to the command words of any language, namely, at least one command word with similar semanteme can be existed.

For example, the instructions include: page turning of the presentation document, rollback of the presentation document, and the like;

for the instruction of 'page turning of the presentation document', the following command words with similar semantics can be corresponded: page turning, page turning to the next page, page skipping to the next page, and the like;

for the "presentation document rollback" instruction, the following semantically similar command words may be corresponding to the instruction: return to previous page, rollback, etc.

The data processing method of the embodiment of the application can also be applied to simultaneous interpretation scenes. When the method is applied to a simultaneous interpretation scenario, the electronic device may specifically be a device for implementing simultaneous interpretation, such as a simultaneous interpretation server applied in the method shown in fig. 1, and the voice data may be collected by an operation terminal and sent to the simultaneous interpretation server. The operation terminal may be the PC or the like.

Specifically, the simultaneous interpretation scenario may adopt an architecture of a simultaneous transmission system (the simultaneous transmission system may include the above-mentioned simultaneous transmission server, an operation terminal, and a terminal). The data processing method of the embodiment of the present application may be applied to a device for implementing simultaneous interpretation, where the device for implementing simultaneous interpretation may be a device newly added to the framework of the simultaneous interpretation system, or may be a device (e.g., the simultaneous interpretation server and the terminal) in the framework of the simultaneous interpretation system that is improved, so as to implement the method of the embodiment of the present application.

Here, the improvement may include a command word detection module in a device; the command word detection module is used for detecting the voice data and determining an instruction corresponding to the command word when the corresponding command word is detected; and executing corresponding operation according to the instruction.

Specifically, in the simultaneous interpretation scene of the conference, when the speaker is speaking, the operation terminal (e.g., PC) may be provided with or connected to a voice acquisition module, such as a microphone, and perform voice acquisition through the voice acquisition module to obtain voice data, and send the voice data to the device for realizing simultaneous interpretation; the operation terminal can also project the demonstration document to a display screen and display the demonstration document to a user through the display screen. The device for realizing simultaneous interpretation receives the voice data, executes the data processing method of the embodiment of the application, realizes corresponding operation on the demonstration document, and carries out simultaneous interpretation aiming at the speech content in the voice data.

When the device for realizing simultaneous interpretation is a server, the server receives the voice data sent by the operation terminal, executes the data processing method of the embodiment of the application, realizes corresponding operation on the presentation document, and performs simultaneous interpretation aiming at the speech content in the voice data.

The device for realizing simultaneous interpretation can also be a terminal held by a user, the operating terminal or a server for receiving voice data can send the voice data to the terminal held by the user, the terminal held by the user receives the voice data, the data processing method of the embodiment of the application is executed, the corresponding operation on the demonstration document is realized, and the simultaneous interpretation is carried out on the speech content in the voice data.

In practical application, in order to implement simultaneous interpretation, it is necessary to determine the speech content in the speech data, perform text recognition on the speech content, and display the recognition result to the user.

Based on this, in an embodiment, the detecting the lecture content included in the voice data includes:

Specifically, the determining of the content of the speech in the voice data that does not meet the second preset condition includes at least one of:

determining a speech text in the recognition text; the matching degree of any word in the speech text and each word in the command word library is lower than a preset matching degree threshold value; the recognition text is obtained by performing text recognition on the voice data.

Here, the preset threshold value and the preset matching degree threshold value described above may be preset by a developer and stored in the corresponding device.

When the speech content is detected, the speech data can be directly detected to determine the speech content (speech format) therein, or after the speech data is subjected to speech recognition, the recognition text is determined, and then the speech content (text format) in the recognition text is determined, namely the speech text in the recognition text is determined.

After the speech content is determined, the speech content needs to be presented in a form required by a user, specifically, the presenting the speech content when the voice data is played includes:

determining an identification result corresponding to the speech content;

presenting the recognition result when the voice data is played;

wherein the recognition result comprises at least one of: at least one language identification text and at least one language translation voice data.

Here, the determining of the recognition result corresponding to the lecture content includes at least one of:

performing voice recognition on the speech content to obtain a first recognition text; the language corresponding to the first recognition text is the same as the language corresponding to the speech content;

translating the first recognition text to obtain a second recognition text; the language corresponding to the second identification text is different from the language corresponding to the speech content;

and performing voice synthesis on the second recognition text to obtain voice corresponding to the second recognition text, namely the translation voice data corresponding to the speech content.

In practical application, the content specifically included in the identification result may be selected in advance by a user holding the terminal, and the selected result is sent to the server, so that the server provides a corresponding identification result according to the selection of the user.

Based on this, in an embodiment, when the data processing method is applied to a server, the method further includes:

receiving a first acquisition request sent by a terminal; the first obtaining request is used for obtaining a corresponding identification result;

and determining a target identification result according to the first acquisition request, and sending the acquired target identification result to a terminal.

For example, the first obtaining request may be a request to obtain an identification text; or requesting to acquire the translation voice data; but also requests to retrieve recognized text and translated speech data.

In practical application, in order to provide the recognition result corresponding to the language meeting the user requirement, the recognition result of the corresponding language may be obtained according to the obtaining request sent by the user through the terminal.

Based on this, in an embodiment, the method may further include:

receiving a second acquisition request sent by the terminal; the second acquisition request at least comprises: target language;

acquiring an identification result corresponding to the target language from an identification result of at least one language;

and sending the obtained identification result corresponding to the target language to a terminal.

Here, the terminal may be provided with a human-computer interaction interface, a user holding the terminal may select a language through the human-computer interaction interface, the terminal generates a second acquisition request including a target language according to the selection of the user, and sends the second acquisition request to the server, so that the server receives the second acquisition request.

The terminal can be a mobile phone; the method considers that most users carry the mobile phone at present and send the identification result to the mobile phone without adding other equipment to receive and display the identification result, so that the cost can be saved, and the operation is convenient.

In an embodiment, the data processing method may also be applied to a terminal held by a user, the user holding the terminal may select a language and select contents specifically included in the recognition result through a human-computer interaction interface of the terminal, the terminal held by the user determines the language selected by the user and the contents specifically included in the recognition result, and the recognition result meeting the user requirement is obtained and presented by using the data processing method provided in the embodiment of the present application.

In the embodiment of the application, the data processing method is applied to the simultaneous interpretation scene, the voice data continuously changes along with the speech, and the recognition result continuously changes along with the change of the voice data.

The data processing method provided by the embodiment of the application can be suitable for various voice application scenes, such as: the simultaneous interpretation scene and the video conference scene are described; the method is applied to a video conference scene, and can also determine the command words spoken by the user through the command word detection module and perform corresponding operation on the documents displayed in the video conference based on the determined command words.

It should be understood that, the order of the steps described in the foregoing embodiments does not mean the execution order, and the execution order of the steps should be determined by the functions and the inherent logic of the steps, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The data processing method provided by the embodiment of the application acquires voice data and performs content detection on the voice data; when a command word contained in the voice data is detected, carrying out corresponding operation on a presentation document according to an instruction corresponding to the command word; the content of the presentation document is associated with the content of the voice data; the presentation document is used for presenting when the voice data is played; when speech content contained in the voice data is detected, presenting the speech content when the voice data is played; therefore, the corresponding operation can be realized through speaking by the lecturer without manual operation or assistance of the lecturer to assist in executing the corresponding operation, the lecture efficiency is improved, the lecture time is saved, and the user experience can be improved.

Fig. 4 is a schematic flowchart of a keyword detection method according to an embodiment of the present application; as shown in fig. 4, the keyword detection method may be applied to an electronic device (such as the server, a terminal held by a user), and the method includes:

step 401: receiving input voice data;

step 402: performing command word detection on input voice data;

step 403: judging whether a command word is detected; if the command word is detected, go to step 404; if it is determined that no command word is detected, go to step 405;

here, the determining whether the command word is detected includes one of:

judging whether a command word with the pronunciation similarity of one word in the command word library exceeding a preset threshold exists in the voice data;

judging whether a command word matched with one word in the command word bank exists in the identification text; the recognition text is obtained by performing text recognition on the voice data.

Step 404: determining an instruction corresponding to the command word, and executing corresponding operation according to the instruction;

here, the instruction may be an operation instruction for a presentation document, and a corresponding operation is performed on the presentation document according to the determined operation instruction, and specifically, refer to the method shown in fig. 3.

It should be noted that, in other scenarios, the command word may be expanded based on user requirements. The command word may also include operations for other programs, such as adjusting the volume of the electronic device (e.g., turning up the volume), and after determining the corresponding command word for the volume, determining and executing a corresponding instruction to adjust the volume.

Step 405: the command word continues to be detected.

FIG. 5 is a schematic flow chart illustrating a data processing method according to an embodiment of the present application; as shown in fig. 5, the data processing method is applied to an electronic device (such as the server and the terminal held by the user), and the method includes:

step 501: performing voice recognition on voice data;

here, the voice data is a voice spoken by a speaker during a lecture in a simultaneous interpretation scenario.

Step 502: performing command word detection on the voice data; judging whether a command word is detected or not, and entering step 503 if the command word is detected; if the command word is not detected, go to step 504;

here, the performing command word detection on the voice data includes:

detecting a command word in the voice data, wherein the similarity of the pronunciation of one word in the command word bank exceeds a preset threshold;

detecting a command word in the recognition text which is matched with one word in the command word bank; the recognition text is obtained by performing text recognition on the voice data.

Here, the recognition text corresponding to the voice data may be detected to determine whether the command word is detected, or the voice data may be directly detected to determine the command word during the voice recognition. Specifically, reference may be made to determining a command word in the speech data meeting the first preset condition in the method shown in fig. 3, which is not described herein again.

Step 503: determining an instruction corresponding to the command word, and executing corresponding operation according to the instruction;

step 504: performing machine translation on the recognition text corresponding to the voice data to obtain a translation text;

here, the recognition text is a text obtained by performing speech recognition on speech data.

Step 505: judging whether voice synthesis is needed; if it is determined that speech synthesis is required, the step 506 is performed, and if it is determined that speech synthesis is not required, the step 507 is performed;

here, whether or not speech synthesis is necessary may be set in advance by a developer and stored in a corresponding device.

Step 506: performing voice synthesis on the translation text;

step 507: and outputting a synchronous transmission result.

Here, the results of the concurrent transmission may include: recognizing the text and translating the text; when it is determined that speech synthesis is required, the peer-to-peer result may further include: the speech data (i.e., the speech data obtained by speech synthesizing the translated text) is translated.

FIG. 6 is a flowchart illustrating a command execution method during simultaneous interpretation according to an embodiment of the present disclosure; as shown in fig. 6, the method for executing commands in the simultaneous interpretation process includes:

step 601: the lecturer uses the simultaneous transmission system to perform lecture;

step 602: when the speaker needs to execute the corresponding operation, the speaker speaks the corresponding command word;

step 603: the synchronous transmission system determines a corresponding command word by using the data processing method and executes an instruction corresponding to the command word;

step 604: the presenter continues to use the simulcast system to perform the presentation.

In order to implement the data processing method of the embodiment of the application, the embodiment of the application also provides a data processing device. FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application; as shown in fig. 7, the data processing apparatus includes:

an acquisition unit 71 configured to acquire voice data and perform content detection on the voice data;

the first processing unit 72 is configured to, when a command word included in the voice data is detected, perform a corresponding operation on the presentation document according to an instruction corresponding to the command word;

a second processing unit 73 configured to, when speech content included in the voice data is detected, present the speech content when the voice data is played.

In an embodiment, the first processing unit 72 is configured to query a command word library according to the voice data, and determine a command word in the voice data that meets a first preset condition.

In an embodiment, the first processing unit 72 is configured to determine a command word meeting a first preset condition in the voice data, and includes at least one of:

In an embodiment, the first processing unit 72 is configured to query an instruction library according to the command word, and determine a target instruction corresponding to the command word; the target instructions characterize operational instructions for a presentation document; the instruction library comprises at least one instruction and command words corresponding to the instructions in the at least one instruction;

In an embodiment, the second processing unit 73 is configured to determine, according to the speech data query command word bank, speech contents in the speech data that do not meet a second preset condition.

In an embodiment, the second processing unit 73 is configured to determine that the content of the speech in the voice data does not meet the second preset condition, where the content of the speech does not meet the second preset condition, and the second processing unit includes at least one of:

In an embodiment, the second processing unit 73 is configured to determine a recognition result corresponding to the lecture content;

presenting the recognition result when the voice data is played;

In practical applications, the obtaining Unit 71, the first Processing Unit 72, and the second Processing Unit 73 may be implemented by a Processor in an electronic device (such as the server and a terminal held by a user), such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU) or a Programmable Gate Array (FPGA).

It should be noted that: the apparatus provided in the foregoing embodiment is only exemplified by the division of each program module when performing data processing, and in practical applications, the above processing may be distributed to different program modules according to needs, that is, the internal structure of the terminal is divided into different program modules to complete all or part of the above-described processing. In addition, the apparatus provided in the above embodiments and the data processing method embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Based on the hardware implementation of the above devices, an electronic device is further provided in the embodiments of the present application, fig. 8 is a schematic diagram of a hardware composition structure of the electronic device in the embodiments of the present application, as shown in fig. 8, an electronic device 80 includes a memory 83, a processor 82, and a computer program stored in the memory 83 and capable of running on the processor 82; when the processor 82 of the electronic device executes the program, the method provided by one or more technical solutions of the electronic device side is implemented.

In particular, the processor 82 located in the electronic device 80, when executing the program, implements: acquiring voice data and carrying out content detection on the voice data;

It should be noted that, the specific steps implemented when the processor 82 located in the electronic device 80 executes the program have been described in detail above, and are not described herein again.

It is to be understood that the electronic device also includes a communication interface 81; the various components in the electronic device are coupled together by a bus system 84. It will be appreciated that the bus system 84 is configured to enable connected communication between these components. The bus system 84 includes a power bus, a control bus, a status signal bus, and the like, in addition to the data bus.

It is to be understood that the memory 83 in the embodiments of the present application may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memories described in the embodiments of the present application are intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiments of the present application may be applied to the processor 82, or implemented by the processor 82. The processor 82 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 82. The processor 82 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 82 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located on a storage medium in memory where the processor 82 reads the information from the memory and performs the steps of the methods described above in conjunction with its hardware.

The embodiment of the application also provides a storage medium, in particular a computer storage medium, and more particularly a computer readable storage medium. The electronic device comprises a processor, a storage medium, a memory, a storage medium, a processing unit, a display unit and a controller.

In the several embodiments provided in the present application, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, all functional units in the embodiments of the present application may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media capable of storing program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The technical means described in the embodiments of the present application may be arbitrarily combined without conflict.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

A method of data processing, comprising:

acquiring voice data and carrying out content detection on the voice data;

when a command word contained in the voice data is detected, carrying out corresponding operation on a presentation document according to an instruction corresponding to the command word; the content of the presentation document is associated with the content of the voice data; the presentation document is used for presenting when the voice data is played;

and when the speech content contained in the voice data is detected, presenting the speech content when the voice data is played.
The method of claim 1, wherein detecting a command word contained in the speech data comprises:

and inquiring a command word bank according to the voice data, and determining command words meeting a first preset condition in the voice data.
The method of claim 2, wherein the determining the command word in the voice data that meets the first preset condition comprises at least one of:

determining a command word with the similarity of pronunciation of one word in the command word library exceeding a preset threshold in the voice data;

determining a command word in the recognition text which is matched with one word in the command word bank; the recognition text is obtained by performing text recognition on the voice data.
The method according to any one of claims 1 to 3, wherein the performing the corresponding operation on the presentation document according to the instruction corresponding to the command word comprises:

inquiring an instruction library according to the command words, and determining target instructions corresponding to the command words; the target instructions characterize operational instructions for a presentation document; the instruction library comprises at least one instruction and command words corresponding to the instructions in the at least one instruction;

and executing corresponding operation on the presentation document according to the target instruction.
The method of claim 1, wherein detecting lecture content contained in the speech data comprises:

and inquiring a command word bank according to the voice data, and determining the speech content which does not accord with a second preset condition in the voice data.
The method of claim 5, wherein the determining the lecture content in the voice data that does not meet the second preset condition comprises at least one of:

determining lecture content in the voice data; the similarity between the pronunciation of any word in the speech content and the pronunciation of each word in the command word library is lower than a preset threshold value;

determining a speech text in the recognition text; the matching degree of any word in the speech text and each word in the command word library is lower than a preset matching degree threshold value; the recognition text is obtained by performing text recognition on the voice data.
The method of claim 1, 5 or 6, wherein said presenting said lecture content while said voice data is played comprises:

determining an identification result corresponding to the speech content;

presenting the recognition result when the voice data is played;

wherein the recognition result comprises at least one of: speech text of at least one language, and translation speech data of at least one language.
A simultaneous interpretation apparatus comprising:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is configured to acquire voice data and detect the content of the voice data;

the first processing unit is configured to perform corresponding operation on a presentation document according to an instruction corresponding to a command word when the command word contained in the voice data is detected; the content of the presentation document is associated with the content of the voice data; the presentation document is used for presenting when the voice data is played;

and the second processing unit is configured to present the speech content when the speech data is played when the speech content contained in the speech data is detected.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1 to 7 when executing the program.
A storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 7.