WO2021119908A1

WO2021119908A1 - Data processing method and apparatus, electronic device, and storage medium

Info

Publication number: WO2021119908A1
Application number: PCT/CN2019/125606
Authority: WO
Inventors: 杨林举
Original assignee: 深圳市欢太科技有限公司; Oppo广东移动通信有限公司
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2021-06-24
Also published as: CN114556353A

Abstract

A data processing method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring voice data, and performing content detection on the voice data (301); upon detecting a command word included in the voice data, performing a corresponding operation with respect to a presentation document according to an instruction corresponding to the command word (302), wherein content of the presentation document is associated with content of the voice data, and the presentation document is used for a presentation during playback of the voice data; and upon detecting a speech content included in the voice data, presenting the speech content when the voice data is being played (303).

Description

Data processing method, device, electronic equipment and storage medium

Technical field

This application relates to simultaneous interpretation technology, in particular to a data processing method, device, electronic equipment and storage medium.

Background technique

With the rapid development of artificial intelligence technology, the concept of artificial intelligence (AI) has gradually come to reality from the black technology in the laboratory and applied to all aspects of real life.

The simultaneous interpretation system is a voice translation product for conference scenes that has appeared in recent years. It uses AI technology to provide multilingual text translation and text display for conference speakers' speech content.

In the related simultaneous interpretation system, the operation of the presentation document used by the conference lecturer needs to be operated by the lecturer or auxiliary personnel, which is inconvenient for the lecturer.

Summary of the invention

To solve related technical problems, embodiments of the present application provide a data processing method, device, electronic equipment, and storage medium.

The embodiment of the application provides a data processing method, including:

Acquire voice data, and perform content detection on the voice data;

When the command word contained in the voice data is detected, corresponding operations are performed on the presentation document according to the instruction corresponding to the command word; the content of the presentation document is associated with the content of the voice data; the presentation document is used for Presenting when the voice data is played;

When the speech content contained in the voice data is detected, the speech content is presented when the voice data is played.

In the above solution, detecting the command words contained in the voice data includes:

The command word database is inquired according to the voice data, and the command words in the voice data that meet the first preset condition are determined.

In the above solution, the determining the command word in the voice data that meets the first preset condition includes at least one of the following:

Determining a command word in the voice data whose pronunciation similarity with a word in the command dictionary exceeds a preset threshold;

Determine a command word in the recognized text that matches a word in the command dictionary; the recognized text is obtained by text recognition of the voice data.

In the above solution, the corresponding operation on the presentation document according to the instruction corresponding to the command word includes:

According to the command word query instruction library, the target instruction corresponding to the command word is determined; the target instruction represents an operation instruction for the presentation document; the instruction library includes at least one instruction and each instruction corresponding to the at least one instruction Command word

Perform corresponding operations on the presentation document according to the target instruction.

In the above solution, detecting the speech content contained in the voice data includes:

According to the voice data query command vocabulary, the speech content in the voice data that does not meet the second preset condition is determined.

In the above solution, the determination of the speech content that does not meet the second preset condition in the voice data includes at least one of the following:

Determining the speech content in the speech data; the similarity between the pronunciation of any word in the speech content and the pronunciation of each word in the command word library is lower than a preset threshold;

Determine the speech text in the recognized text; the matching degree between any word in the speech text and each word in the command dictionary is lower than the preset matching degree threshold; the recognized text is obtained by text recognition of the speech data .

In the above solution, the presenting the speech content when the voice data is played includes:

Determine the recognition result corresponding to the speech content;

Presenting the recognition result when the voice data is played;

Wherein, the recognition result includes at least one of the following: speech text in at least one language, and translated speech data in at least one language.

The embodiment of the present application also provides a simultaneous interpretation device, including:

The acquiring unit is configured to acquire voice data and perform content detection on the voice data;

The first processing unit is configured to, when a command word contained in the voice data is detected, perform a corresponding operation on the presentation document according to the instruction corresponding to the command word; the content of the presentation document is associated with the content of the voice data ; The presentation document is used to present when the voice data is played;

The second processing unit is configured to, when detecting speech content included in the voice data, present the speech content when the voice data is played.

The embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements any of the foregoing data processing methods when the program is executed. step.

The embodiment of the present application also provides a storage medium on which computer instructions are stored, and when the instructions are executed by a processor, the steps of any of the foregoing data processing methods are implemented.

The data processing method, device, electronic equipment, and storage medium provided in the embodiments of the present application acquire voice data, and perform content detection on the voice data; when the command word contained in the voice data is detected, the command word is Corresponding instructions perform corresponding operations on the presentation document; the content of the presentation document is associated with the content of the voice data; the presentation document is used to present the voice data when the voice data is played; the voice data is detected When the speech content is included, the speech content is presented when the voice data is played; in this way, the speaker can realize the corresponding operation by speaking, without the need for manual operation or assisting the lecturer to assist in the corresponding operation of the presentation document, which improves Speech efficiency, saving speech time, which can improve user experience.

Description of the drawings

Figure 1 is a schematic flow diagram of the simultaneous interpretation method in related technologies;

Figure 2 is a schematic flow diagram of a command execution method in the simultaneous interpretation process in related technologies;

FIG. 3 is a schematic flowchart of a data processing method according to an embodiment of the application;

4 is a schematic diagram of another flow chart of a data processing method according to an embodiment of the application;

FIG. 5 is a schematic flowchart of still another data processing method according to an embodiment of the application;

6 is a schematic flowchart of a command execution method in a simultaneous interpretation process according to an embodiment of the application;

FIG. 7 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application;

FIG. 8 is a schematic diagram of the composition structure of an electronic device according to an embodiment of the application.

Detailed ways

The application will be further described in detail below in conjunction with the drawings and specific embodiments.

Figure 1 is a schematic flow diagram of the simultaneous interpretation method in related technologies; as shown in Figure 1, after the simultaneous interpretation server is started, the speaker uses the simultaneous interpretation server to give a speech; during the speech, the simultaneous interpretation server obtains the speaker’s voice data , Perform voice recognition on the voice data to obtain the recognized text in the same language as the voice data; then perform machine translation on the recognized text to obtain the translated text; the simultaneous interpretation server determines whether the translated text needs to be synthesized into speech, and determines the need to synthesize speech , The speech is synthesized according to the translated text, and the synthesized speech, the recognized text, and the translated text are sent out as the target result; if it is determined that no synthesized speech is needed, the recognized text and the translated text are sent out as the target result.

Specifically, the simultaneous interpretation server may collect the voice data through an operating terminal; the operating terminal may be a personal computer (PC, Personal Computer). The simultaneous interpretation server may send the target result to the operation terminal, and the operation terminal screens to the display screen to display the recognized text and the translated text; the simultaneous interpretation server may also send the target result to the terminal through the terminal The human-computer interaction interface displays recognized text and translated text, and the synthesized voice is played through the voice playback module of the terminal; thus, the target result is displayed to the user, and the content of the lecturer’s speech is translated into the language required by the user and displayed. Here, the terminal may be a mobile phone, a tablet computer, etc.; the terminal is held by the user.

During the speech, the speaker can also display the presentation document through the operation terminal, specifically project the presentation document to the display screen, and show it to the user through the display screen; here, the presentation document may be presentation software (PPT, PowerPoint) documents, Word documents, etc.

When the lecturer needs to perform operations such as playing, turning pages, rewinding, or stopping playing the presentation document, the lecturer or assistant lecturer needs to use the mouse, page turning pen and other devices to click the corresponding button. In this process, in order to perform operations such as playing, turning pages, and rewinding, the lecturer needs to stop the lecture and perform corresponding operations on the presentation document before continuing the lecture, or the lecturer needs to issue a command to the assistant lecturer to assist the lecturer to perform After the corresponding operation, the speaker will continue to speak.

For details, please refer to the flowchart shown in Figure 2. As shown in Figure 2, the command execution method in the simultaneous interpretation process includes: When the speaker uses the simultaneous interpretation system to give a speech, when the speaker needs to perform a presentation on the presentation document For a certain operation, when there is no auxiliary lecturer, the lecturer performs corresponding operations on the corresponding equipment (such as the above operation terminal), such as page turning, backing, etc., so that the corresponding equipment determines and executes the corresponding command; when there is an auxiliary lecturer When the lecturer issues a command to the assistant lecturer, the assistant lecturer performs corresponding operations on the corresponding device, so that the corresponding device determines and executes the corresponding command; after the lecturer waits for the corresponding device to execute the corresponding command, the lecturer continues to use simultaneous interpretation The system gives a speech. Here, the simultaneous interpretation system may include the foregoing simultaneous interpretation server, operating terminal, display screen, and terminal.

In the related technology, when the speaker operates the corresponding equipment (such as the above operation for the presentation document), the speaker needs to stop the speech and then perform the corresponding operation on the presentation document, which increases the length of the speech and destroys the audience’s experience; When assisting the lecturer, the lecturer needs to issue an order to the assistant lecturer, and the assistant lecturer will perform corresponding operations on the corresponding equipment, which will also increase the labor cost required for the lecture.

Based on this, in various embodiments of the present application, voice data is acquired, and the content of the voice data is detected; when a command word contained in the voice data is detected, a demonstration is performed according to the instruction corresponding to the command word. The document performs corresponding operations; the content of the presentation document is associated with the content of the voice data; the presentation document is used to present the voice data when the voice data is played; when the speech content contained in the voice data is detected , The speech content is presented when the voice data is played; thus, the speaker can perform corresponding operations on the presentation document by speaking, without having to manually operate or assist the speaker to assist in performing the corresponding operations, improve the efficiency of the speech, and save the speech Time, which can improve the user experience.

The embodiment of the application provides a data processing method. FIG. 3 is a schematic flowchart of the data processing method of the embodiment of the application; as shown in FIG. 3, the method includes:

Step 301: Acquire voice data, and perform content detection on the voice data;

Step 302: When the command words contained in the voice data are detected, perform corresponding operations on the presentation document according to the instructions corresponding to the command words;

Here, the content of the presentation document is associated with the content of the voice data; the presentation document is used for presentation when the voice data is played;

Step 303: When the speech content contained in the voice data is detected, present the speech content when the voice data is played.

Here, the content displayed by the presentation document is associated with the content of the voice data, which means that the presentation document is a document associated with the content of the speech that is displayed on the display screen when the lecturer is speaking, such as the presentation software (PPT) displayed during the lecture. , PowerPoint) documents, Word documents, etc.

The presentation document is used for presentation when the voice data is played, which means that the presentation document is presented while the voice data is being played. That is, the data processing method can be applied to any meeting scene where a presentation document needs to be displayed, such as a seminar.

The data processing method can be applied to electronic equipment; the electronic equipment can be a server, a terminal, or the like.

In actual application, the electronic device may be the server, and the server projects the presentation document to the display screen for display. During the speech, the server receives the voice data and executes the data processing method of the embodiment of the present application. , Realize the corresponding operation on the presentation document.

The electronic device may also be a terminal, and the presentation document may be projected from the server to the display screen for display. During the speech process, the terminal receives the voice data, executes the data processing method of the embodiment of the application, and determines the content of the presentation document Instruction, the corresponding instruction is sent to the server, and the server implements the corresponding operation on the presentation document.

When the electronic device is a terminal, the presentation document can also be displayed through the terminal's human-computer interaction interface; during the speech process, the terminal receives the voice data, executes the data processing method in the embodiment of the application, and realizes Perform corresponding operations on the presentation document displayed on the human-computer interactive interface.

Here, when the speaker is giving a speech, the voice data can be collected by the operating terminal. The operating terminal (such as a mobile terminal such as a PC) can be equipped with or connected to a voice collection module, such as a microphone, through the voice collection module. Voice collection is performed to obtain voice data, and the voice data is sent to an electronic device (specifically, the server or terminal described above), and the electronic device executes the data processing method of the embodiment of the present application.

In actual application, in order to determine whether the user needs to operate the presentation document, the voice data needs to be detected; when it is detected that the voice data contains command words, it can be determined that the user needs to operate the presentation document.

Based on this, in an embodiment, detecting the command words contained in the voice data includes:

Specifically, the determination of the command word that meets the first preset condition in the voice data includes at least one of the following:

Here, in actual application, you can directly detect the voice data to determine the command words (voice format), or you can determine the recognized text in the same language as the voice data after the voice data is recognized, and determine the command in the recognized text Word (text format).

Correspondingly, the command vocabulary includes at least one command word, and each command word corresponds to a voice format expression (that is, the pronunciation of the command word) and a text format expression (that is, the text of the command word).

In actual application, the command words determined based on the pronunciation (specifically, the command words in the voice data determined above and the pronunciation of a word in the command dictionary exceed a preset threshold) and the command words determined based on the text (Specifically referring to the command word that matches a word in the command word library in the above-identified recognized text) is the same, the command word can be directly determined;

The command words determined based on pronunciation may also be different from the command words determined based on text. In this case, the target command words need to be determined from the command words determined based on the pronunciation and the command words determined based on the text.

Specifically, the command word determined based on pronunciation is recorded as the first command word; the command word determined based on the text is recorded as the second command word;

When it is determined that the first command word is different from the second command word, the first weight and the second weight are determined; the first weight characterization determines the credibility of the command word based on pronunciation, and the second weight characterization is based on The text determines the credibility of the command word; weighting the first command word according to the first weight, and weighting the second command word according to the second weight; Select one of the first command word and the second command word as the target command word.

For example, when it is determined that the first command word is command word A and the second command word is also command word A, then command word A is directly used as the target person.

When it is determined that the first command word is command word A, and the second command word is determined to be command word B, the first weight and the second weight need to be determined; the command word A is weighted according to the first weight, The command word B is weighted according to the second weight; the weighted result of the command word A and the weighted result of the command word B are obtained, and the weighted result of the command word A and the weighted result of the command word B are compared to determine The weighted result of command word A is greater than the weighted result of command word B, then command word A is used as the target command word, and vice versa, command word B is used as the target command word.

Here, considering that it is possible to obtain multiple first command words based on pronunciation; it is also possible to obtain multiple second command words through text; that is, the number of first command words is at least two, and the number of second command words For at least two.

Under corresponding circumstances, weighting each first command word in the at least two first command words respectively according to the first weight, to obtain a weighted processing result for each first command word;

Weighting each second command word in the at least two second command words according to the second weight, to obtain a weighted processing result for each second command word;

It is determined that the at least two first command words and the at least two second command words do not have the same command word (that is, each command word in the at least two first command words, and each command word in the at least two second command words Words are not the same), then according to the weighted processing result of each first command word and the weighted processing result of each second command word, select weighted processing from at least two first command words and at least two second command words The command word with the largest result is used as the target command word;

It is determined that the at least two first command words and the at least two second command words have the same command word (that is, a certain command word exists in at least two first command words, and there is a command word in at least two second command words The same command word), the weighted processing results for the same command word are added together to obtain the weighted processing result for each command word; according to the weighted processing result of each command word, from at least two first command words And at least two second command words, the command word with the largest weighted processing result is selected as the target command word.

For example, based on pronunciation, command word A and command word B are obtained, and the probability of command word A being the target command word is a1%, and the probability of command word B being the target command word is b%; here, a1%+ b% can be equal to 1;

Obtain command word A and command word C based on the text, and get command word A as the target command word. The probability that command word A is the target command word is a2%, and the probability of command word C being the target command word is c%; here, a2%+c% can be equal to 1;

Suppose the first weight is x, the second weight is y, and x+y=1;

The weighted processing results for each command word are as follows:

The weighted processing result of command word A: a1%*x+a2%*y;

The weighted processing result of command word B: b%*x;

The weighted processing result of the command word C: c%*y;

The command word with the largest weighted processing result is selected from the command word A, the command word B, and the command word C as the target command word.

After detecting the command word, the electronic device can determine that the user needs to perform a corresponding operation on the presentation document, and the electronic device can determine the corresponding operation instruction according to the detected command word to realize the control of the presentation document.

Specifically, the corresponding operation on the presentation document according to the instruction corresponding to the command word includes:

Through the determined target instruction, the corresponding operation is performed on the presentation document, that is, the control of the presentation document is realized.

For example, when the command word "back" is detected, the first control instruction for the presentation document that controls the back of the presentation document is determined according to the command word "back"; that is, the first control instruction is used to control Back the presentation document (that is, turn to the previous page);

When the command word "page turning" is detected, a second control instruction for controlling the page turning of the presentation document is determined according to the command word "page turning"; that is, the second control instruction is used to control the presentation Turn the document page (that is, turn to the next page).

The first control instruction may be specifically executed by a program that displays a presentation document. For example, if the presentation document is a PPT document, the corresponding program may be a Microsoft Office PowerPoint application;

That is, the Microsoft Office PowerPoint application determines the first control instruction and executes the corresponding back operation.

The second control instruction is similar to the first control instruction. The second control instruction may be presented by a program that displays a presentation document. For example, if the presentation document is a PPT document, the corresponding program may be Microsoft Office PowerPoint application;

That is, the Microsoft Office PowerPoint application determines the second control instruction, and executes the corresponding page turning operation.

Here, the instruction library may be preset by the developer and stored in the electronic device. The instructions stored in the instruction library and the command words corresponding to the corresponding instructions can be provided to the speaker for viewing in advance.

Specifically, when the data processing method is applied to a server (that is, the electronic device is a server), the instruction library may be preset by the developer and stored in the server.

When the data processing method is applied to a terminal (that is, the electronic device is a terminal), the instruction library may be preset by the developer and saved in the server, and then the set instruction library may be sent to the terminal; accordingly, when After the instruction library is updated, the server may send the updated instruction library to the terminal again, and the terminal receives and saves the corresponding instruction library.

The language of the command word can have at least one language, so that for speakers of different languages, the target command corresponding to the command word can be obtained by searching the command library.

For command words in any language, there can be at least one expression, that is, there can be at least one command word with similar semantics.

For example, the instructions include: page turning of the presentation document, rewinding of the presentation document, etc.;

For the "turning presentation document page" command, there can be corresponding command words with similar semantics as follows: turn page, turn to the next page, skip to the next page, etc.;

For the "return of presentation document" command, there can be corresponding command words with similar semantics as follows: return to the previous page, return, etc.

The data processing method of the embodiment of the present application can also be applied to a simultaneous interpretation scene. When applied to a simultaneous interpretation scenario, the electronic device may specifically be a device that realizes simultaneous interpretation, such as the simultaneous interpretation server used in the method shown in Figure 1, and the voice data may be collected by the operating terminal and sent to the simultaneous interpretation. server. The operating terminal may be the aforementioned PC or the like.

Specifically, the simultaneous interpretation scene may adopt the architecture of a simultaneous interpretation system (the simultaneous interpretation system may include the foregoing simultaneous interpretation server, operating terminal, and terminal). The data processing method of the embodiment of the present application can be applied to a device that implements simultaneous interpretation. The device that implements simultaneous interpretation may be a newly added device in the architecture of the simultaneous interpretation system, or it may be used for simultaneous interpretation. A certain device (such as the simultaneous interpretation server and terminal) in the system architecture can be improved to be able to implement the method of the embodiment of the present application.

Here, the improvement can include a command word detection module in a certain device; the command word detection module is used to detect the voice data, and when a corresponding command word is detected, determine the command corresponding to the command word ; Perform corresponding operations according to the instructions.

Specifically, in the simultaneous interpretation scenario of a conference, when the speaker is giving a speech, the operating terminal (such as a PC) may be equipped with or connected to a voice collection module, such as a microphone, through which voice collection is performed to obtain the voice And send the voice data to the device for realizing simultaneous interpretation; the operating terminal can also project the presentation document to the display screen, and show it to the user through the display screen. The device for implementing simultaneous interpretation receives voice data, executes the data processing method of the embodiment of the present application, implements corresponding operations on the presentation document, and performs simultaneous interpretation on the speech content in the voice data.

When the device for realizing simultaneous interpretation is a server, the server receives the voice data sent by the operating terminal, executes the data processing method in the embodiment of this application, and implements corresponding operations on the presentation document and addresses the speech data in the voice data. The content is interpreted simultaneously.

The device for realizing simultaneous interpretation may also be a terminal held by the user, the operating terminal or the server that receives voice data may send the voice data to the terminal held by the user, and the terminal held by the user Receive voice data, execute the data processing method of the embodiment of the present application, implement corresponding operations on the presentation document, and perform simultaneous interpretation of the speech content in the voice data.

In actual applications, in order to achieve simultaneous interpretation, it is necessary to determine the speech content in the speech data, perform text recognition on the speech content, and display the recognition result to the user.

Based on this, in an embodiment, detecting the speech content contained in the voice data includes:

Specifically, the determining the speech content in the voice data that does not meet the second preset condition includes at least one of the following:

Determine the speech text in the recognized text; the matching degree of any word in the speech text with each word in the command dictionary is lower than the preset matching degree threshold; the recognized text is obtained by text recognition of the speech data .

Here, the above-mentioned preset threshold and preset matching degree threshold may be preset by the developer and stored in the corresponding device.

When detecting the speech content, you can directly detect the speech data to determine the speech content (speech format), or you can determine the recognized text after the speech data is recognized, and then determine the speech content in the recognized text (text format), That is, the speech text in the recognized text is determined.

After determining the content of the speech, it is necessary to present the content of the speech in a form required by the user. Specifically, the presenting the content of the speech when the voice data is played includes:

Determine the recognition result corresponding to the speech content;

Presenting the recognition result when the voice data is played;

Wherein, the recognition result includes at least one of the following: recognized text in at least one language, and translated speech data in at least one language.

Here, the identification result corresponding to the determined speech content includes at least one of the following:

Performing voice recognition on the speech content to obtain a first recognized text; the language corresponding to the first recognized text is the same as the language corresponding to the speech content;

Translating the first recognized text to obtain a second recognized text; the language corresponding to the second recognized text is different from the language corresponding to the speech content;

The speech synthesis is performed on the second recognized text to obtain the speech corresponding to the second recognized text, which is the translated speech data corresponding to the speech content.

In actual application, the content specifically included in the recognition result can be selected in advance by the user holding the terminal, and the selected result is sent to the server, so that the server provides the corresponding recognition result according to the user's selection.

Based on this, in an embodiment, when the data processing method is applied to a server, the method further includes:

Receiving a first acquisition request sent by a terminal; the first acquisition request is used to acquire a corresponding recognition result;

According to the first acquisition request, the target recognition result is determined, and the acquired target recognition result is sent to the terminal.

For example, the first acquisition request may be a request for acquiring recognized text; it may also be a request for acquiring translated voice data; it may also be a request for acquiring recognized text and translated voice data.

In actual application, in order to provide a recognition result corresponding to a language that meets the needs of the user, the recognition result of the corresponding language can be obtained according to the acquisition request sent by the user through the terminal.

Based on this, in an embodiment, the method may further include:

Receiving a second acquisition request sent by the terminal; the second acquisition request includes at least: the target language;

Acquiring the recognition result corresponding to the target language from the recognition result of at least one language;

The obtained recognition result corresponding to the target language is sent to the terminal.

Here, the terminal may be provided with a human-computer interaction interface. The user holding the terminal can select a language through the human-computer interaction interface. The terminal generates a second acquisition request containing the target language according to the user's selection, and sends the second acquisition request to the server , So that the server receives the second acquisition request.

The terminal may be a mobile phone; this is considering that most users currently carry a mobile phone with them and send the recognition result to the mobile phone without adding other devices to receive and display the recognition result, which can save cost and is convenient to operate.

In an embodiment, the data processing method can also be applied to a terminal held by a user. The user holding the terminal can select the language and the content specifically included in the recognition result through the human-computer interaction interface of the terminal. The terminal determines the language selected by the user and the specific content included in the recognition result, and uses the data processing method provided in the embodiment of the present application to obtain and present the recognition result that meets the needs of the user.

In the embodiment of the present application, the data processing method is applied in a simultaneous interpretation scenario. As the speech proceeds, the voice data will continue to change, and the recognition result will also continue to change as the voice data changes.

The data processing method provided in the embodiments of this application can be applied to a variety of voice application scenarios, such as the above-mentioned simultaneous interpretation scenario and video conference scenario; in the video conference scenario, the command word detection module can also be used to determine what the user said Command words, based on the determined command words to perform corresponding operations on the documents displayed in the video conference.

It should be understood that the order of the steps described in the above embodiments does not mean the order of execution. The order of execution of the processes should be determined by their functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

The data processing method provided by the embodiment of the application obtains voice data and performs content detection on the voice data; when a command word contained in the voice data is detected, the presentation document is corresponding to the instruction corresponding to the command word Operation; the content of the presentation document is associated with the content of the voice data; the presentation document is used to present the voice data when the voice data is played; when the speech content contained in the voice data is detected, the The speech content is presented when the voice data is played; in this way, the speaker can realize the corresponding operation by speaking, without manual operation or assisting the speaker to assist in the corresponding operation, which improves the efficiency of the speech, saves the speech time, and can improve the user Experience.

FIG. 4 is a schematic flowchart of a keyword detection method according to an embodiment of the application; as shown in FIG. 4, the keyword detection method can be applied to electronic devices (such as the above-mentioned server and a terminal held by a user). include:

Step 401: Receive input voice data;

Step 402: Perform command word detection for the input voice data;

Step 403: Determine whether the command word is detected; when it is determined that the command word is detected, go to step 404; when it is determined that the command word is not detected, go to step 405;

Here, the judgment whether the command word is detected includes one of the following:

Judging whether there is a command word in the voice data whose pronunciation similarity to a word in the command dictionary exceeds a preset threshold;

It is determined whether there is a command word matching a word in the command dictionary in the recognized text; the recognized text is obtained by text recognition of the voice data.

Step 404: Determine the instruction corresponding to the command word, and execute the corresponding operation according to the instruction;

Here, the instruction may be an operation instruction for the presentation document, and a corresponding operation is performed on the presentation document according to the determined operation instruction. For details, refer to the method shown in FIG. 3.

It should be noted that in other scenarios, the command word can also be extended based on user requirements. The command word may also include operations for other programs, such as adjusting the volume of the electronic device (such as increasing the volume, etc.). After determining the corresponding command word for the volume, determine and execute the corresponding instruction to realize the adjustment of the volume. .

Step 405: Continue to detect the command word.

Fig. 5 is a schematic flow diagram of another data processing method according to an embodiment of the application; as shown in Fig. 5, the data processing method is applied to an electronic device (such as the above-mentioned server and a terminal held by a user), and the method includes:

Step 501: Perform voice recognition on voice data;

Here, the voice data is the voice spoken by the speaker during the speech in the simultaneous interpretation scenario.

Step 502: Perform command word detection on the voice data; determine whether the command word is detected, and if it is determined that the command word is detected, then go to step 503; if it is determined that the command word is not detected, then go to step 504;

Here, the performing command word detection on the voice data includes:

Detecting command words in the voice data whose pronunciation similarity with a word in the command dictionary exceeds a preset threshold;

A command word matching a word in the command dictionary in the recognized text is detected; the recognized text is obtained by text recognition of the voice data.

Here, the recognized text corresponding to the voice data can be detected to determine whether the command word is detected, or the voice data can be directly detected during the voice recognition process to determine the command word. For details, refer to the method shown in FIG. 3 for determining the command word in the voice data that meets the first preset condition, which will not be repeated here.

Step 503: Determine the instruction corresponding to the command word, and execute the corresponding operation according to the instruction;

Step 504: Perform machine translation on the recognized text corresponding to the voice data to obtain the translated text;

Here, the recognized text is text obtained by performing voice recognition on voice data.

Step 505: Determine whether speech synthesis is required; if it is determined that speech synthesis is required, go to step 506, and if it is determined not to perform speech synthesis, go to step 507;

Here, the need for speech synthesis can be preset by the developer and stored in the corresponding device.

Step 506: Perform speech synthesis on the translated text;

Step 507: Output the result of simultaneous interpretation.

Here, the simultaneous interpretation result may include: recognized text and translated text; when it is determined that speech synthesis is required, the simultaneous interpretation result may also include: translated voice data (that is, voice data obtained by performing voice synthesis on the translated text).

FIG. 6 is a schematic flowchart of a method for executing commands in a simultaneous interpretation process according to an embodiment of the application; as shown in FIG. 6, the method for executing commands in a simultaneous interpretation process includes:

Step 601: The speaker uses the simultaneous interpretation system to give a speech;

Step 602: When the speaker needs to perform a corresponding operation, the speaker speaks the corresponding command word;

Step 603: The simultaneous interpretation system uses the above data processing method to determine the corresponding command word, and executes the instruction corresponding to the command word;

Step 604: The lecturer continues to use the simultaneous interpretation system to give a lecture.

In order to implement the data processing method of the embodiment of the present application, the embodiment of the present application also provides a data processing device. FIG. 7 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application; as shown in FIG. 7, the data processing device includes:

The obtaining unit 71 is configured to obtain voice data and perform content detection on the voice data;

The first processing unit 72 is configured to perform corresponding operations on the presentation document according to instructions corresponding to the command words when the command words contained in the voice data are detected;

The second processing unit 73 is configured to, when detecting speech content included in the voice data, present the speech content when the voice data is played.

In an embodiment, the first processing unit 72 is configured to query a command vocabulary according to the voice data, and determine a command word in the voice data that meets a first preset condition.

In an embodiment, the first processing unit 72 is configured to determine a command word that meets a first preset condition in the voice data, including at least one of the following:

In an embodiment, the first processing unit 72 is configured to query an instruction library according to the command word to determine a target instruction corresponding to the command word; the target instruction represents an operation instruction for a presentation document; the instruction The library includes at least one instruction and a command word corresponding to each instruction in the at least one instruction;

In an embodiment, the second processing unit 73 is configured to query a command word database according to the voice data, and determine the speech content in the voice data that does not meet the second preset condition.

In an embodiment, the second processing unit 73 is configured to determine the speech content that does not meet the second preset condition in the voice data, including at least one of the following:

In an embodiment, the second processing unit 73 is configured to determine a recognition result corresponding to the speech content;

Presenting the recognition result when the voice data is played;

In practical applications, the acquisition unit 71, the first processing unit 72, and the second processing unit 73 can all be processors in electronic devices (such as the aforementioned servers and terminals held by users), such as a central processing unit ( CPU, Central Processing Unit, Digital Signal Processor (DSP, Digital Signal Processor), Microcontroller Unit (MCU) or Programmable Gate Array (FPGA, Field-Programmable Gate Array) etc. are implemented.

It should be noted that when the device provided in the above embodiment performs data processing, only the division of the above-mentioned program modules is used as an example. In actual applications, the above-mentioned processing can be allocated by different program modules as needed, that is, the terminal The internal structure is divided into different program modules to complete all or part of the processing described above. In addition, the device provided in the foregoing embodiment and the data processing method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.

Based on the hardware implementation of the above device, an embodiment of the application also provides an electronic device. FIG. 8 is a schematic diagram of the hardware composition of the electronic device according to the embodiment of the application. As shown in FIG. 8, the electronic device 80 includes a memory 83 and a processor. 82 and a computer program stored on the memory 83 and capable of running on the processor 82; the processor 82 located in the electronic device implements the method provided by one or more technical solutions on the electronic device side when the program is executed.

Specifically, when the processor 82 located in the electronic device 80 executes the program, it realizes: acquiring voice data, and performing content detection on the voice data;

It should be noted that the specific steps implemented when the processor 82 in the electronic device 80 executes the program have been described in detail above, and will not be repeated here.

It can be understood that the electronic device further includes a communication interface 81; various components in the electronic device are coupled together through the bus system 84. It can be understood that the bus system 84 is configured to implement connection and communication between these components. In addition to the data bus, the bus system 84 also includes a power bus, a control bus, and a status signal bus.

It can be understood that the memory 83 in the embodiment of the present application may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory. Among them, the non-volatile memory can be read-only memory (ROM, Read Only Memory), programmable read-only memory (PROM, Programmable Read-Only Memory), and erasable programmable read-only memory (EPROM, Erasable Programmable Read- Only Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM), Ferromagnetic Random Access Memory (FRAM), Flash Memory, Magnetic Surface Memory , CD-ROM, or CD-ROM (Compact Disc Read-Only Memory); magnetic surface memory can be magnetic disk storage or tape storage. The volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache. By way of exemplary but not restrictive description, many forms of RAM are available, such as static random access memory (SRAM, Static Random Access Memory), synchronous static random access memory (SSRAM, Synchronous Static Random Access Memory), and dynamic random access memory. Memory (DRAM, Dynamic Random Access Memory), Synchronous Dynamic Random Access Memory (SDRAM, Synchronous Dynamic Random Access Memory), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Rate Synchronous Dynamic Random Access Memory), enhanced Type synchronous dynamic random access memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), synchronous connection dynamic random access memory (SLDRAM, SyncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, Direct Rambus Random Access Memory) ). The memories described in the embodiments of the present application are intended to include, but are not limited to, these and any other suitable types of memories.

The method disclosed in the foregoing embodiments of the present application may be applied to the processor 82 or implemented by the processor 82. The processor 82 may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 82 or instructions in the form of software. The aforementioned processor 82 may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like. The processor 82 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or any conventional processor or the like. Combining the steps of the method disclosed in the embodiments of the present application, it may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module may be located in a storage medium, and the storage medium is located in a memory. The processor 82 reads the information in the memory and completes the steps of the foregoing method in combination with its hardware.

The embodiment of the present application also provides a storage medium, which is specifically a computer storage medium, and more specifically, a computer-readable storage medium. Stored thereon are computer instructions, that is, a computer program, which is a method provided by one or more technical solutions on the electronic device side when the computer instructions are executed by a processor.

In the several embodiments provided in this application, it should be understood that the disclosed method and smart device can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.

The units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.

In addition, the functional units in the embodiments of the present application may all be integrated into a second processing unit, or each unit may be individually used as a unit, or two or more units may be integrated into one unit; The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

A person of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: various media that can store program codes, such as a mobile storage device, ROM, RAM, magnetic disk, or optical disk.

Alternatively, if the aforementioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application can be embodied in the form of a software product in essence or a part that contributes to the prior art. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is allowed to execute all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

It should be noted that: "first", "second", etc. are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.

In addition, the technical solutions described in the embodiments of the present application can be combined arbitrarily without conflict.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application.

Claims

A data processing method, including:

Acquire voice data, and perform content detection on the voice data;

When the command word contained in the voice data is detected, corresponding operations are performed on the presentation document according to the instruction corresponding to the command word; the content of the presentation document is associated with the content of the voice data; the presentation document is used for Presenting when the voice data is played;

When the speech content contained in the voice data is detected, the speech content is presented when the voice data is played.
The method according to claim 1, wherein detecting the command words contained in the voice data comprises:

The command word database is inquired according to the voice data, and the command words in the voice data that meet the first preset condition are determined.
The method according to claim 2, wherein the determining the command words in the voice data that meet the first preset condition includes at least one of the following:

Determining a command word in the voice data whose pronunciation similarity with a word in the command dictionary exceeds a preset threshold;

Determine a command word in the recognized text that matches a word in the command dictionary; the recognized text is obtained by text recognition of the voice data.
The method according to any one of claims 1 to 3, wherein the corresponding operation on the presentation document according to the instruction corresponding to the command word comprises:

According to the command word query instruction library, the target instruction corresponding to the command word is determined; the target instruction represents an operation instruction for the presentation document; the instruction library includes at least one instruction and each instruction corresponding to the at least one instruction Command word

Perform corresponding operations on the presentation document according to the target instruction.
The method according to claim 1, wherein detecting the speech content contained in the voice data comprises:

According to the voice data query command vocabulary, the speech content in the voice data that does not meet the second preset condition is determined.
The method according to claim 5, wherein the determining the speech content in the voice data that does not meet the second preset condition comprises at least one of the following:

Determining the speech content in the speech data; the similarity between the pronunciation of any word in the speech content and the pronunciation of each word in the command word library is lower than a preset threshold;

Determine the speech text in the recognized text; the matching degree of any word in the speech text with each word in the command dictionary is lower than the preset matching degree threshold; the recognized text is obtained by text recognition of the speech data .
The method according to claim 1, 5, or 6, wherein the presenting the speech content when the voice data is played includes:

Determine the recognition result corresponding to the speech content;

Presenting the recognition result when the voice data is played;

Wherein, the recognition result includes at least one of the following: speech text in at least one language, and translated speech data in at least one language.
A simultaneous interpretation device, including:

The acquiring unit is configured to acquire voice data and perform content detection on the voice data;

The first processing unit is configured to, when a command word contained in the voice data is detected, perform a corresponding operation on the presentation document according to the instruction corresponding to the command word; the content of the presentation document is associated with the content of the voice data ; The presentation document is used to present when the voice data is played;

The second processing unit is configured to, when detecting speech content included in the voice data, present the speech content when the voice data is played.
An electronic device comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the steps of the method according to any one of claims 1 to 7 when the processor executes the program.
A storage medium having computer instructions stored thereon, and when the instructions are executed by a processor, the steps of the method according to any one of claims 1 to 7 are realized.