CN111062221A

CN111062221A - Data processing method, data processing device, electronic equipment and storage medium

Info

Publication number: CN111062221A
Application number: CN201911283529.8A
Authority: CN
Inventors: 郝杰
Original assignee: Beijing Opper Communication Co Ltd
Current assignee: Beijing Opper Communication Co Ltd
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2020-04-24

Abstract

The invention discloses a data processing method, a data processing device, electronic equipment and a storage medium. Wherein the method comprises the following steps: acquiring voice data to be processed, and performing text recognition on the voice data to acquire a recognition text; the recognition text is used for presenting when the voice data is played; performing word segmentation on the recognition text to obtain at least one word; determining a keyword based on the occurrence frequency of the obtained at least one word in the identification text, and determining a target segment in the identification text by using the determined keyword; determining a first presentation format of the target segment to present the target segment in the first presentation format when presenting the recognized text; the first presentation format is different from the second presentation format; the second presentation format is the presentation format of other characters except the target segment in the recognition text.

Description

Data processing method, data processing device, electronic equipment and storage medium

Technical Field

The present invention relates to simultaneous interpretation technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.

Background

The machine co-transmission translation system is used for automatically recognizing the speaking content of a speaker of a co-transmission conference by utilizing an Automatic Speech Recognition (ASR) technology and converting the speaking content of the speaker from speech data into text data; and then, translating the text data by utilizing a Machine Translation (MT) technology, converting the speaking content of the speaker into a target language text, and displaying the Translation result to the participants of the co-transmission conference. With the remarkable development of the automatic speech recognition technology and the machine translation technology, the machine co-transmission translation system has reached a practical stage, and the difference between the machine co-transmission translation system and the manual co-transmission translation system is smaller and smaller.

However, in the related art, when a simultaneous interpretation conference is performed, because the speaking rate of a speaker is generally faster, the frequency of switching the subtitles displayed to the participants by the machine simultaneous interpretation system is also higher, and sometimes the participants have switched to the subtitles of the next screen without completely seeing the subtitles of one screen, which seriously affects the understanding of the speaking content of the speaker by the participants.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device, electronic equipment and a storage medium.

The embodiment of the invention provides a data processing method, which comprises the following steps:

acquiring voice data to be processed, and performing text recognition on the voice data to acquire a recognition text; the recognition text is used for presenting when the voice data is played;

performing word segmentation on the recognition text to obtain at least one word; determining a keyword based on the occurrence frequency of the obtained at least one word in the identification text, and determining a target segment in the identification text by using the determined keyword;

determining a first presentation format of the target segment to present the target segment in the first presentation format when presenting the recognized text; the first presentation format is different from the second presentation format; the second presentation format is the presentation format of other characters except the target segment in the recognition text.

In the above solution, the determining a keyword based on the obtained frequency of occurrence of the at least one word in the identification text includes:

filtering the obtained at least one word to obtain a filtered word segmentation result;

and determining keywords based on the occurrence frequency of each word in the filtered word segmentation result in the identification text.

for each word, determining the frequency of occurrence of the respective word in the recognized text; and when the occurrence frequency of the corresponding word in the recognition text meets a first preset condition, determining the corresponding word as a keyword.

In the above solution, the determining the frequency of occurrence of the corresponding word in the recognized text includes one of:

determining the occurrence frequency of the corresponding word in the word segmentation result corresponding to the recognition text;

determining the occurrence frequency of the corresponding word in the first information base; the first information base stores historical word segmentation results of the current simultaneous interpretation process;

determining the occurrence frequency of the corresponding word in the second information base; the second information base comprises the first information base and word segmentation results corresponding to the recognition texts.

In the above scheme, when determining the keyword, the method further includes:

weighting the occurrence frequency of the corresponding word in the recognition text and the probability of the corresponding word in the technical field corresponding to the voice data to be processed to obtain a weighting result; and when the weighting result meets a first preset condition, determining the corresponding word as a keyword.

In the above solution, the determining the target segment in the recognition text by using the determined keyword includes one of:

determining the keyword as the target segment;

dividing the recognition text into at least one text segment; and determining the text segment containing the keywords as the target segment.

In the foregoing solution, the segmenting the recognized text includes:

and segmenting the recognized text by using a preset segmentation model.

determining a keyword by using the occurrence frequency of at least one obtained word in the identification text under the condition that the current simultaneous interpretation process meets a second preset condition;

alternatively, the first and second electrodes may be,

under the condition that the current simultaneous interpretation process does not meet a second preset condition, determining a keyword based on the occurrence frequency of the obtained at least one word in the identification text and combining a third information base; the third information base stores a technical term associated with a technical field corresponding to the voice data to be processed.

In the foregoing solution, when determining the first presentation format of the target segment, the method further includes:

aiming at a first keyword in the determined keywords, determining a presentation format corresponding to the appearance frequency of the first keyword in the identification text by utilizing the corresponding relation between the appearance frequency of the word and the presentation format;

and taking the determined presentation format as the presentation format of the text corresponding to the first keyword in the target segment.

An embodiment of the present invention further provides a data processing apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voice data to be processed and performing text recognition on the voice data to acquire a recognition text; the recognition text is used for presenting when the voice data is played;

the first processing unit is used for segmenting the recognized text to obtain at least one word; determining a keyword based on the occurrence frequency of the obtained at least one word in the identification text, and determining a target segment in the identification text by using the determined keyword;

a second processing unit, configured to determine a first presentation format of the target segment, so as to present the target segment in the first presentation format when presenting the recognized text; the first presentation format is different from the second presentation format; the second presentation format is the presentation format of other characters except the target segment in the recognition text.

An embodiment of the present invention further provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor;

wherein the processor is configured to perform the steps of any of the above methods when running the computer program.

An embodiment of the present invention further provides a storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of any one of the above methods are implemented.

According to the data processing method, the data processing device, the electronic equipment and the storage medium, voice data to be processed are obtained, text recognition is carried out on the voice data, and a recognition text is obtained; the recognition text is used for presenting when the voice data is played; performing word segmentation on the recognition text to obtain at least one word; determining a keyword based on the occurrence frequency of the obtained at least one word in the identification text, and determining a target segment in the identification text by using the determined keyword; determining a first presentation format of the target segment to present the target segment in the first presentation format when presenting the recognized text; the first presentation format is different from the second presentation format; the second presentation format is the presentation format of other characters except the target segment in the recognition text. According to the scheme of the embodiment of the invention, the keywords are determined based on the occurrence frequency of at least one word obtained by segmenting the recognition text in the recognition text, the target segment in the recognition text is determined by using the determined keywords, and the presentation format of the target segment is determined, so that the presentation format of the target segment is different from the presentation formats of other words except the target segment in the recognition text when the recognition text is presented, and thus, the key information of the speech of the speaker can be extracted, the key information can be presented in a key way when the conference-sharing data is presented for the participants of the conference-sharing, and the participants can grasp the key information of the speech of the speaker, and the speech content of the speaker can be better understood.

Drawings

FIG. 1 is a schematic diagram of a machine co-transmission system in the related art;

FIG. 2 is a first flowchart illustrating a data processing method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a process of determining a target segment according to an embodiment of the present invention;

FIG. 4 is a second flowchart illustrating a data processing method according to an embodiment of the present invention;

FIG. 5 is a flow chart of a data processing method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further elaborated by combining the drawings and the embodiments in the specification.

Before the technical solution of the embodiment of the present invention is explained in detail, a machine synchronous transfer translation system in the related art is first briefly explained, and hereinafter, the machine synchronous transfer translation system is simply referred to as a machine synchronous transfer system.

FIG. 1 is a schematic diagram of a machine co-transmission system in the related art; as shown in fig. 1, the system may include: the system comprises a machine simultaneous transmission server, a voice processing server, a terminal held by a user, an operation terminal and a display screen. The terminal held by the user can be a mobile phone, a tablet personal computer and the like; the operation end can adopt a Personal Computer (PC), a mobile phone, and the like, wherein the PC can be a desktop Computer, a notebook Computer, a tablet Computer, and the like.

In practical application, a speaker of the simultaneous transmission conference can speak through an operation end, in the speaking process, the operation end collects voice data of the speaker and sends the collected voice data to a machine simultaneous transmission service end, and the machine simultaneous transmission service end identifies the voice data through a voice processing server to obtain an identification text (the identification text can be an identification text of the same language as the voice data, and can also be an identification text of other languages obtained after the identification text is translated); the machine simultaneous transmission server side can send the identification text to the operation side, and the operation side projects the identification text to a display screen; the identification text can also be sent to a terminal held by the user (specifically, the identification result of the corresponding language is correspondingly sent according to the language needed by the participant), and the identification text is displayed for the participant, so that the speech content of the speaker is translated into the language needed by the participant and displayed. Wherein the voice processing server may include: a speech recognition module (i.e. the speech recognition system), a text smoothing module, and a machine translation module. The voice recognition module is used for performing text recognition on voice data of a user to obtain a recognition text; the text smoothing module is configured to perform format processing on the recognition text, for example: smooth spoken language, punctuation recovery, inverse text standardization and the like; and the machine translation module is used for translating the recognition text after format processing into a text of another language, so that a translation text is obtained.

Here, it should be noted that fig. 1 is only an illustration of the structure of the machine simulcast system, and in practical application, the machine simulcast system may also be implemented on one mobile device.

However, in the process of carrying out the synchronous transmission conference, the speaking speed of the speaker may be faster, so that when the display screen or the terminal held by the user displays the speaking content of the speaker for the participant, the speed of switching the display data is also faster; in the case where the participant has not finished viewing the presentation data of one screen, there may be a case where the display screen or the terminal held by the user has switched to the presentation data of the next screen. Therefore, how to enable the participants to grasp the key information of the speech of the speaker more quickly and understand the speech content of the speaker more effectively is a problem to be solved urgently.

In one embodiment, text recognition is carried out on voice data to be processed to obtain a recognition text; performing word segmentation on the recognition text to obtain at least one word; determining a keyword based on the occurrence frequency of the obtained at least one word in the identification text, determining a segment needing to be displayed in a highlight manner in the identification text by using the determined keyword, and determining the presentation format of the segment needing to be displayed in the highlight manner; so that the presentation format of the segment that needs to be highlighted is distinguished from the presentation format of the other text in the recognized text. Therefore, key information can be extracted from the speech of the speaker, so that the key information can be presented in a key mode when the co-transmission data are displayed for the participants of the co-transmission conference, the participants can grasp the key information of the speech of the speaker, and the speech content of the speaker can be further better understood.

The embodiment of the invention provides a data processing method, which is applied to electronic equipment; as shown in fig. 2, the method comprises the steps of:

step 201: acquiring voice data to be processed, and performing text recognition on the voice data to acquire a recognition text;

here, the recognized text is used for presentation while the voice data is played.

Step 202: performing word segmentation on the recognition text to obtain at least one word; and determining a keyword based on the occurrence frequency of the at least one word in the identification text, and determining a target segment in the identification text by using the determined keyword.

Step 203: determining a first presentation format of the target segment to present the target segment in the first presentation format when presenting the recognized text;

here, the first presentation format is different from the second presentation format; the second presentation format is the presentation format of other characters except the target segment in the recognition text.

Specifically, in step 201, the recognition text is used for presenting when the voice data is played, which means that the recognition text is presented when the voice data is played, that is, the data processing method provided in the embodiment of the present invention may be applied to a simultaneous interpretation scenario, where the simultaneous interpretation scenario may adopt a system structure shown in fig. 1, and the electronic device may be a device newly added to the system structure shown in fig. 1, or may be an improvement on a device in the architecture shown in fig. 1, so as to implement the method in the embodiment of the present invention.

It should be noted that, in practical application, in a simultaneous interpretation scenario, the speech data will change continuously as the speaker speaks, and thus the recognition text also changes continuously as the speech data changes.

In practical application, the electronic device may be a terminal or a server. In the case that the electronic device is a terminal, the terminal may acquire the to-be-processed voice data through a voice acquisition module (such as a microphone) provided by the terminal or a voice acquisition module that establishes a communication connection with the terminal. Under the condition that the electronic equipment is a server, the terminal can collect the voice data to be processed, and then the server obtains the voice data to be processed from the terminal; or, the server may also directly collect the to-be-processed voice data through a voice collection module provided by the server or a voice collection module establishing communication connection with the server.

For example, in practical application, in a simultaneous interpretation scenario, when a speaker speaks, a first terminal (e.g., an operation terminal shown in fig. 1) may acquire the content of the utterance in real time by using a voice acquisition module, so as to obtain the voice data to be processed. The first terminal and the server for realizing simultaneous interpretation can establish communication connection, the first terminal sends the acquired voice data to the server for realizing simultaneous interpretation, and the server can acquire the voice data to be processed in real time and perform text recognition on the voice data to obtain a recognition text for presentation, namely, the recognition text is presented while the voice data is played.

In step 201, in actual application, the recognition text obtained according to the voice data may correspond to one or more languages, and the recognition texts in different languages are displayed to participants in different languages.

Here, the identification text corresponds to at least one language, the identification text may be an identification text in the same language (denoted as the first language) as the voice data to be processed, or may be an identification text in another language after the identification text in the first language is translated, specifically, the identification text in the second language, … …, or the identification text in the nth language, where N is greater than or equal to 1.

In practical application, when the recognition text is a text in the same language as the speech data to be processed, the text recognition is performed on the speech data to obtain a recognition text, including:

carrying out voice recognition on the voice data to obtain a recognition text of a first language; the first language is the same as the language corresponding to the voice data.

When the recognition text is a text of a language different from the language of the voice data to be processed, performing text recognition on the voice data to obtain a recognition text, including:

carrying out voice recognition on the voice data to obtain a recognition text of a first language; the first language is the same as the language corresponding to the voice data;

and performing machine translation on the recognition text of the first language by using a preset translation model to obtain recognition texts of other languages.

The speech data is subjected to text recognition in the above manner, and the obtained recognition text corresponds to at least one language, that is, the recognition text of the first language, the recognition text of the second language, … …, and the recognition text of the nth language can be obtained according to the speech data, where N is greater than or equal to 1.

Here, the translation model is used to translate text of one language into text of another language. In practical applications, the translation model may be a model trained by using a machine learning technique (such as a neural network technique).

In practical application, when voice recognition is carried out on the voice data, the recognized text may have the problems of punctuation mark shortage, punctuation mark error, sentence smoothness and the like; therefore, the text recognized by the voice can be sorted by using the preset text sorting model, and the text output by the preset text sorting model is used as the recognized text of the first language.

The preset text arrangement model is used for arranging the input text and outputting the arranged text; the collated input text comprises at least one of: adding, modifying or deleting punctuation marks for the input text; and adjusting the word order of the input text. In practical applications, the preset text finishing model may be a model obtained by training using a machine learning technique (such as a neural network technique).

In practical application, after obtaining the identification text, the electronic device may directly present the identification text on a display screen provided by the electronic device or a display screen establishing a communication connection with the electronic device, or may send the identification text to a second terminal (e.g., an operation terminal shown in fig. 1 or a terminal held by a user), and when playing the voice data, the second terminal presents the identification text for a participant in the conference, so that the participant can read the identification text to know the content of the voice data. The second terminal can preset a target translation language, and also can provide language selection service for the participants through a man-machine interaction interface or a man-machine interaction interface which establishes communication connection with the second terminal, and the participants can select the target translation language required by the participants on the man-machine interaction interface; after the second terminal determines the target translation language, the second terminal may send a text recognition request message including the target translation language to the electronic device, and the electronic device sends the recognition text corresponding to the target translation language to the second terminal for presentation. Of course, the target translation language may also be preset in the electronic device, or a language selection service is provided for the participant through a human-computer interaction interface provided by the electronic device or a human-computer interaction interface established in communication with the electronic device, after the target translation language is determined, the identification text corresponding to the target translation language is directly presented, and the identification text corresponding to the target translation language is sent to the second terminal for presentation.

For step 202, in an embodiment, the segmenting the recognized text may include:

and segmenting the recognized text by using a preset segmentation model.

Here, the preset word segmentation model is configured to perform word segmentation on an input text and output at least one word obtained by the segmentation. In practical applications, the preset word segmentation model may be a model obtained by training using a machine learning technique (such as a neural network technique).

With respect to step 202, in an embodiment, as shown in fig. 3, the determining a keyword based on the frequency of occurrence of the obtained at least one word in the identification text may include the following steps:

step 2021: filtering the obtained at least one word to obtain a filtered word segmentation result;

step 2022: and determining keywords based on the occurrence frequency of each word in the filtered word segmentation result in the identification text.

For step 2021, the filtering the obtained at least one word may include:

and filtering the obtained at least one word by utilizing a preset filtering model.

Specifically, the preset filtering model is configured to filter the input words, filter out words with a small amount of information that are often used in daily life but are actually expressed, such as common words (e.g., words like classmates, good, thank you, start, and end) and stop words (e.g., words that are kept away from people, division, and also strike), and output the filtered words, that is, output the filtered word segmentation result. In practical applications, the preset filtering model may be a model obtained by training using a machine learning technique (such as a neural network technique).

In practical application, in the stages of opening a scene, closing a sentence and the like of a conference in the same pass, the words in the voice data to be processed may all be words which are frequently used in daily life but have a small amount of actually expressed information; that is, performing word segmentation on the recognition text to obtain at least one word, and when filtering the at least one word, filtering all the at least one word, so that a filtered word segmentation result cannot be obtained; in this case, it is described that there is no part of the recognition text that needs to be presented with emphasis, and the recognition text may be directly presented in a preset basic presentation format.

Based on this, in an embodiment, the method may further include: and when the at least one word is filtered, determining the presentation format of the recognized text as a preset presentation format under the condition that the at least one word is completely filtered.

Of course, in a case that the at least one word is not filtered out completely, all words included in the filtered word segmentation result may be determined as the target segment. In practical applications, the target segment may be at least one word, sentence or paragraph.

In practical application, the word segmentation result (which can be an unfiltered word segmentation result or a filtered word segmentation result) contains a large number of words, and if all the words contained in the word segmentation result are determined as target segments, the situation that the text displayed in the text is too much is identified, so that the participant cannot more intuitively grasp the key information of the speaker to speak; at this time, the occurrence frequency of each word contained in the word segmentation result in the recognition text can be determined, the keywords are determined by the words with the occurrence frequencies meeting the preset conditions, and then the target segments are determined by the determined keywords; therefore, the text needing to be presented in a key manner can be further simplified, so that the participants can grasp key information of the speaker for speaking more intuitively, and the speaking content of the speaker is further understood better.

Based on this, for step 2022, in an embodiment, the determining a keyword based on the frequency of occurrence of the obtained at least one word in the identification text may include:

Here, each word may be each word included in the unfiltered word segmentation result, or may be each word included in the filtered word segmentation result.

In practice, the determining the frequency of occurrence of the corresponding word in the recognized text may include one of the following:

Here, the segmentation result may be an unfiltered segmentation result or a filtered segmentation result.

In practical application, any one of the above methods can be selected as required to determine the occurrence frequency of the corresponding word in the recognition text. When the occurrence frequency of the corresponding word in the recognition text is determined, the occurrence frequency of the corresponding word can be counted, and the occurrence frequency of the corresponding word is directly used as the occurrence frequency of the corresponding word; or calculating the value of the number of occurrences of the corresponding word divided by the total number of words, and taking the obtained value as the frequency of occurrence of the corresponding word. For example, a first predetermined threshold (e.g., 200) may be set as desired. Determining the occurrence frequency of the corresponding word in the word segmentation result corresponding to the recognition text under the condition that the total word number of the word segmentation result corresponding to the recognition text is greater than or equal to a first preset threshold value, and taking the occurrence frequency as the occurrence frequency of the corresponding word; or determining the value of the number of occurrences of the corresponding word in the word segmentation result corresponding to the recognition text divided by the total number of words of the word segmentation result corresponding to the recognition text, and taking the determined value as the frequency of occurrence of the corresponding word. Determining the occurrence frequency of corresponding words by using the historical word segmentation results of the current simultaneous interpretation process under the condition that the total word number of the word segmentation results corresponding to the recognition text is smaller than a first preset threshold value; namely, the frequency of occurrence of the corresponding word in the first information base is determined (the frequency of occurrence of the corresponding word in the first information base is determined, and the frequency of occurrence is taken as the frequency of occurrence of the corresponding word; or the frequency of occurrence of the corresponding word in the first information base is determined by dividing the frequency of occurrence of the corresponding word by the total number of words in the first information base, and the determined value is taken as the frequency of occurrence of the corresponding word), or the frequency of occurrence of the corresponding word in the second information base is determined (the frequency of occurrence of the corresponding word in the second information base is determined, and the frequency of occurrence of the corresponding word is taken as the frequency of occurrence of the corresponding word, or the frequency of occurrence of the corresponding word in the second information base is determined by dividing the frequency of occurrence of the corresponding word by the total number of words in the second information base, and the determined value is taken.

In practical application, the historical word segmentation result of the current simultaneous interpretation process can be stored in a local database or a cache of the server, and can also be stored in the cloud.

Based on this, in an embodiment, the method further comprises:

and acquiring the first information base from a local place, a cache or a cloud.

In practical application, the first information base is continuously updated based on the current simultaneous interpretation process, that is, each time a word segmentation result (which may be an unfiltered word segmentation result or a filtered word segmentation result) is determined, the determined word segmentation result can be added to the first information base to obtain a second information base; and updating the first information base by using the second information base before obtaining the next voice data to be processed.

In practical application, the first preset condition can be set as required. For example, the first preset condition may be a preset threshold (denoted as a second preset threshold), and the second preset threshold may also be set as needed (for example, the second preset threshold may be set to be 50 when the occurrence frequency is taken as the occurrence frequency of the corresponding word; for example, the second preset threshold may be set to be 25% when the value obtained by dividing the occurrence frequency of the corresponding word by the total word number is taken as the occurrence frequency of the corresponding word); when the appearance frequency of the corresponding word in the recognition text is greater than or equal to the second preset threshold value, the appearance frequency of the corresponding word in the recognition text can be regarded as meeting a first preset condition; of course, when the occurrence frequency of the corresponding word in the recognition text is less than the second preset threshold, it can be regarded that the occurrence frequency of the corresponding word in the recognition text does not meet the first preset condition.

In practical application, the situation that corresponding words possibly occur frequently in the recognition text but are not related to the technical field corresponding to the voice data to be processed is considered, namely the situation is not in accordance with the theme of the current simultaneous interpretation conference and the situation is not required to be presented to the participants in a focused manner; therefore, when determining the keyword, determining the probability that the corresponding word is associated with the technical field corresponding to the voice data to be processed, performing weighting processing on the occurrence frequency of the corresponding word in the recognition text and the probability that the corresponding word is associated with the technical field corresponding to the voice data to be processed, and when the result obtained by the weighting processing meets a first preset condition, determining the corresponding word as the keyword; thus, the keywords can be further accurately determined, so that the participants can grasp the key information of the speech of the speaker, and the speech content of the speaker can be further better understood.

Based on this, in an embodiment, when determining the keyword, the method may further include:

Here, it should be noted that, in the case of performing the weighting process, the manner of determining the occurrence frequency of the corresponding word in the recognition text includes one of the following:

determining the value of the number of occurrences of the corresponding word in the word segmentation result corresponding to the recognition text divided by the total number of words of the word segmentation result corresponding to the recognition text;

determining the value of the number of occurrences of the corresponding word in the first information base divided by the total number of words in the first information base;

determining a value of the number of occurrences of the corresponding word in the second information repository divided by the total number of words in the second information repository.

In actual application, after a plurality of keywords are determined, the determined keywords can be directly determined as target segments; the recognition text can also be divided into at least one text segment, and the text segment containing at least one keyword in a plurality of keywords is determined as a target segment; and a third preset threshold value can be set, and for each text segment in the at least one text segment, when the number of the keywords contained in the corresponding text segment is greater than or equal to the third preset threshold value, the corresponding text segment is determined as the target segment. Therefore, when the text needing to be presented in a highlight manner is determined, the highlight strength of the highlight information of the speaker for speaking can be determined according to needs, and the speaking content of the speaker can be better understood by the participant.

Based on this, in an embodiment, the determining the target segment in the recognition text by using the determined keyword may include one of:

determining the keyword as the target segment;

In step 203, in actual application, the first presentation format and the second presentation format may include at least one of the following:

a font;

the word size;

font color.

The font may include bold or not, italic or not, underlined or not, and other font formats.

In practical application, in order to further highlight the text corresponding to the keyword with higher occurrence frequency in the target segment, the corresponding relationship between the occurrence frequency of the word and the presentation format may be preset; when the first presentation format of the target segment is determined, the first presentation format of the target segment is determined by utilizing the corresponding relation between the occurrence frequency of the preset words and the presentation format; thus, the key information of the speaker can be further highlighted, and the participant can better understand the content of the speaker.

Based on this, in an embodiment, when determining the first presentation format of the target segment, the method may further include:

Here, it should be noted that, for all texts in the target segment, the presentation format may be the same or different.

In practical application, the text corresponding to the first keyword in the target segment may be the same text as the first keyword or a text segment containing the first keyword; in the case where the text corresponding to the first keyword in the target segment is a text segment (denoted as a first text segment) containing the first keyword, the first text segment may contain other keywords than the first keyword among the determined keywords; at this time, the occurrence frequency of each keyword in the at least one keyword included in the first text segment in the recognition text may be determined to obtain at least one occurrence frequency; determining the average occurrence frequency of the obtained at least one occurrence frequency, and determining the presentation format corresponding to the average occurrence frequency as the presentation format of the first text segment by using the corresponding relation between the occurrence frequency of the word and the presentation format; or determining the maximum occurrence frequency of the obtained at least one occurrence frequency, and determining the presentation format corresponding to the maximum occurrence frequency as the presentation format of the first text segment by using the corresponding relation between the occurrence frequency of the word and the presentation format.

In practical application, the corresponding relationship between the appearance frequency of the word and the presentation format may be set as required, and in the case that the appearance frequency of the word is larger, the corresponding presentation format is set to be more prominent (i.e. the difference from the second presentation format is larger).

For example, each occurrence frequency value may be set to correspond to a presentation format; for example, assuming that the second presentation format is "4 th word, regular script, and black", at this time, the presentation format corresponding to the occurrence frequency value of 25% may be "3 rd word, regular script, and black", and the presentation format corresponding to the occurrence frequency value of 26% may be "3 rd word, regular script, bold, and black"; when the frequency of occurrence of the first keyword in the recognition text is 25%, determining the presentation format of the text corresponding to the first keyword in the target segment as a '3 rd character, a regular script, and black'; when the frequency of occurrence of the first keyword in the recognition text is 26%, determining the presentation format of the text corresponding to the first keyword in the target segment as "" 3 character, regular script, bold and black ".

For another example, in order to increase the computing speed of the electronic device and reduce the simultaneous interpretation delay, at least one occurrence frequency range and a presentation format corresponding to each occurrence frequency range may be set; for example, assuming that the second presentation format is "4 th word, regular script, black", three frequency ranges may be set: 25% -50%, 50% -75% and 75% -100%, and setting the presentation formats corresponding to the frequency ranges of 25% -50% as '3 #, regular script and black', the presentation formats corresponding to the frequency ranges of 50% -75% as '3 #, regular script, bold and black', and the presentation formats corresponding to the frequency ranges of 75% -100% as '2 #, song script, bold and red'; when the frequency of the first keywords in the recognition text is in the frequency range of 25% -50%, determining the presentation format of the text corresponding to the first keywords in the target segment as '3-character, regular script, black'; when the frequency of the first keyword in the recognition text is in the frequency range of 50% -75%, determining the presentation format of the text corresponding to the first keyword in the target segment as '3-character, regular script, bold and black'; and when the appearance frequency of the first keyword in the recognition text is within the frequency range of 75-100%, determining the presentation format of the text corresponding to the first keyword in the target segment as '2-number character, song body, bold and red'.

In practical application, there may be situations that the current simultaneous interpretation process is just started, or the historical word segmentation results are few, which are not suitable for determining the keywords by using the occurrence frequency of the corresponding words in the recognition text; at this time, the recognition text and the keyword library may be matched in a manner of presetting the keyword library at a server or a cloud, a word obtained by matching is determined as a keyword, and then the determined keyword is used to determine the target segment. The keyword library can be provided by a speaker in the current simultaneous interpretation process in advance; the term lexicons corresponding to the technical fields may also be pre-stored in the server or the cloud, after the server obtains the voice data to be processed, the technical field corresponding to the voice data is determined first (the technical field corresponding to the current co-transmission conference may be set in the server in advance by a host, or the technical field corresponding to the voice data may be determined by using the preset classification model), then the term lexicon corresponding to the technical field corresponding to the voice data is obtained from the local or the cloud, and the obtained term lexicon is determined as the keyword lexicon.

Based on this, in an embodiment, the determining a keyword based on the obtained frequency of occurrence of the at least one word in the identification text may include:

alternatively, the first and second electrodes may be,

Specifically, when the current simultaneous interpretation process meets a second preset condition, determining a keyword only by using the occurrence frequency of the obtained at least one word in the identification text; and when the current simultaneous interpretation process does not meet a second preset condition, determining a keyword based on the occurrence frequency of the obtained at least one word in the identification text and combining a third information base. That is, in the case that the current simultaneous interpretation process does not satisfy the second preset condition, the first preset condition is that the information stored in the third information base can be matched.

In practical application, the second preset condition may be set as required. For example, the second preset condition may be set to determine whether the running duration of the current simultaneous interpretation process is greater than or equal to a fourth preset threshold (for example, 30 minutes); under the condition that the duration of the current simultaneous interpretation process is greater than or equal to a fourth preset threshold, the current simultaneous interpretation process can be regarded as meeting a second preset condition; and under the condition that the running time of the current simultaneous interpretation process is less than a fourth preset threshold value, the current simultaneous interpretation process can be regarded as not meeting a second preset condition. For another example, the second preset condition may be set to determine whether the total word count of the first thesaurus is greater than or equal to a fifth preset threshold (e.g., 1000); under the condition that the total word number of the first word bank is greater than or equal to a fifth preset threshold value, the current simultaneous interpretation process can be regarded as meeting a second preset condition; and under the condition that the total word number of the first word bank is smaller than a fifth preset threshold value, the current simultaneous interpretation process can be regarded as not meeting a second preset condition.

In practical application, in the same simultaneous interpretation scene, under the condition that the electronic equipment is a server, the server faces a plurality of terminals and sends simultaneous interpretation data to each terminal; in order to ensure the timeliness of the server for simultaneously sending the simultaneous interpretation data to the plurality of terminals, the server can adopt a cache mode and directly obtain corresponding data from the cache when receiving a request for obtaining the simultaneous interpretation data; therefore, high timeliness of sending the simultaneous interpretation data can be guaranteed, and computing resources of the server are protected.

Based on this, in an embodiment, the electronic device is a server, and the simultaneous interpretation data obtained by using the voice data to be processed corresponds to at least one language; the method may further comprise:

classifying and caching the simultaneous interpretation data corresponding to at least one language according to languages;

here, the simultaneous interpretation data of each language includes the recognition text, the target segment, a first presentation format, and a second presentation format.

In practical application, the server may determine the preset language of each terminal in at least one terminal in advance, and obtain the simultaneous interpretation data corresponding to the preset language from the database for caching.

Through the cache operation, when the terminal selects other languages different from the preset language, the simultaneous interpretation data of the corresponding language can be directly obtained from the cache, so that the timeliness and the protection of computing resources can be improved.

In practical application, the terminal selects other languages different from the preset language, the simultaneous interpretation data of the other languages may not be cached, and when the server determines that the terminal sends an acquisition request for selecting other languages different from the preset language, the server can cache the simultaneous interpretation data of the other languages requested by the terminal; when other terminals select the same language, the corresponding simultaneous interpretation data can be directly obtained from the cache, so that the timeliness and the protection of computing resources can be improved.

In practical application, in order to provide the simultaneous interpretation data corresponding to the language meeting the requirements of the participants, the simultaneous interpretation data corresponding to the target language can be acquired according to the acquisition request sent by the participants through the terminal.

Based on this, in an embodiment, as shown in fig. 4, the method may further include the steps of:

step 401: receiving an acquisition request sent by a terminal; the acquisition request is used for acquiring simultaneous interpretation data; the acquisition request at least comprises: target language;

step 402: obtaining simultaneous interpretation data corresponding to the target language from the cached simultaneous interpretation data;

step 403: and sending the obtained simultaneous interpretation data corresponding to the target language to a terminal.

Here, the terminal may be a terminal held by a user in the system architecture shown in fig. 1, the terminal may be provided with a human-computer interaction interface, a participant may select a language through the human-computer interaction interface, the terminal generates an acquisition request including a target language according to the selection of the participant, and sends the acquisition request to the server, so that the server receives the acquisition request.

The data processing method provided by the embodiment of the invention obtains the voice data to be processed, and performs text recognition on the voice data to obtain a recognition text; the recognition text is used for presenting when the voice data is played; performing word segmentation on the recognition text to obtain at least one word; determining a keyword based on the occurrence frequency of the obtained at least one word in the identification text, and determining a target segment in the identification text by using the determined keyword; determining a first presentation format of the target segment to present the target segment in the first presentation format when presenting the recognized text; the first presentation format is different from the second presentation format; the second presentation format is the presentation format of other characters except the target segment in the recognition text. Therefore, key information can be extracted from the speech of the speaker, so that the key information can be presented in a key mode when the co-transmission data are displayed for the participants of the co-transmission conference, the participants can grasp the key information of the speech of the speaker, and the speech content of the speaker can be further better understood.

The present invention will be described in further detail with reference to the following application examples.

The data processing apparatus provided in this application embodiment is applied to a simultaneous interpretation scenario, and includes: a speech recognition module and a machine translation module; wherein the content of the first and second substances,

the voice recognition module is used for converting voice of a speaker for speaking into a text from sound waves;

and the machine translation module is used for translating the text output by the voice recognition module into a text corresponding to the language needed by the participant to obtain a translation result.

The data processing method provided in this application embodiment is applied to the data processing apparatus, and as shown in fig. 5, the data processing method specifically includes the following steps:

step 501: collecting speaking voice of a speaker to be processed (namely the voice data to be processed), and converting the speaking voice from sound waves into texts by using a voice recognition module to obtain recognition texts; step 502 is then performed.

Here, the text obtained by directly recognizing the speech may have the problems of missing punctuation marks, error punctuation marks, unsmooth sentences and the like; therefore, the speech recognition module needs to perform normalization processing such as adding punctuation marks to the recognized text, deleting punctuation marks, modifying punctuation marks, adjusting word order, and the like, determine the processed text as the recognized text, and output the recognized text to the machine translation module.

Step 502: translating the recognition text output by the voice recognition module by using a machine translation module to obtain a translation result; step 503 is then performed.

Here, the recognized text output by the speech recognition module and the translation result output by the machine translation module correspond to the recognized text obtained in step 201 of the data processing method shown in fig. 2; the specific implementation processes of step 501 and step 502 are the same as the specific implementation process of step 201 of the data processing method shown in fig. 2; and will not be described in detail herein.

Step 503: grabbing key sections in the recognition text and key sections in the translation result; step 504 is then performed.

Here, the highlight segments may be words, sentences, or paragraphs.

In practical application, the key segments in the recognition text and the key segments in the translation result can be captured in the following two ways:

the first method is as follows: presetting a professional term dictionary or a user dictionary in the data processing device; the user dictionary is a key word dictionary provided by a speaker in the current simultaneous interpretation process before a conference; the professional term dictionary is a professional term dictionary corresponding to each technical field; the professional term dictionary and the user dictionary each correspond to at least two languages (the same language as the speaking voice of the speaker and any language different from the speaking voice of the speaker). And respectively matching the recognition text and the translation result by using the professional term dictionary or the user dictionary, and respectively determining the text which is the same as the professional term dictionary or the user dictionary in the recognition text and the translation result as a highlight section of the recognition text to be captured and a highlight section of the translation result to be captured.

The second method comprises the following steps: and utilizing a word segmentation tool (namely the preset word segmentation model) to segment words of the recognition text and the translation result respectively. And for the recognition text, filtering words obtained after the segmentation, filtering words which are frequently used in daily life and contain less information, such as stop words and frequently used words, counting the frequency of each word obtained by filtering in the recognition text, and determining the words with higher frequency (namely, the words are greater than or equal to a second preset threshold) as key segments in the recognition text. And for the translation result, filtering words obtained after the segmentation, filtering out words which are frequently used in daily life and have less information content, such as stop words and frequently used words, counting the occurrence frequency of each word obtained by filtering in the translation result, and determining the words with higher frequency (namely, the words are greater than or equal to a second preset threshold) as key segments in the translation result.

For the two methods, the mode is relatively simple and direct, and the key segments in the recognition text and the key segments in the translation result can be quickly captured; the second mode is flexible and comprehensive, and can effectively capture the key segments in the recognition text and the key segments in the translation result without depending on external resources (preset dictionaries); in practical application, the first mode and the second mode can be combined as required to capture the key segments in the recognition text and the key segments in the translation result.

Specifically, the specific implementation process of step 503 is the same as the specific implementation process of step 202 of the data processing method shown in fig. 2; and will not be described in detail herein.

Step 504: and displaying the recognition text and the translation result on a screen of the terminal device, and displaying the key segments in the recognition text and the key segments in the translation result captured in the step 503 in bold and enlarged mode.

Here, the data processing apparatus transmits the recognized text, the translation result, and the presentation format of the recognized text and the translation result to a terminal apparatus, and the terminal apparatus presents the recognized text, the translation result, and the presentation format on a screen.

Specifically, the specific implementation process of step 504 is the same as the specific implementation process of step 203 of the data processing method shown in fig. 2; and will not be described in detail herein.

The data processing device and the data processing method provided by the application embodiment have the following advantages:

the method and the device can extract key information of the speech of the speaker so as to display the key information in a key way when the participants of the conference in the conference show the data in the conference, thereby assisting the participants to grasp the key information of the speech of the speaker through the subtitles in the conference and further understanding the speech content of the speaker.

In order to implement the method of the embodiment of the present invention, the embodiment of the present invention further provides a data processing apparatus; as shown in fig. 6, the data processing apparatus 600 includes: an acquisition unit 601, a first processing unit 602, and a second processing unit 603; wherein the content of the first and second substances,

the acquiring unit 601 is configured to acquire voice data to be processed, perform text recognition on the voice data, and acquire a recognition text; the recognition text is used for presenting when the voice data is played;

the first processing unit 602 is configured to perform word segmentation on the recognition text to obtain at least one word; determining a keyword based on the occurrence frequency of the obtained at least one word in the identification text, and determining a target segment in the identification text by using the determined keyword;

the second processing unit 603 is configured to determine a first presentation format of the target segment, so as to present the target segment in the first presentation format when the recognized text is presented; the first presentation format is different from the second presentation format; the second presentation format is the presentation format of other characters except the target segment in the recognition text.

In an embodiment, the first processing unit 602 is specifically configured to:

for each word, determining the frequency of occurrence of the respective word in the recognized text; when the occurrence frequency of the corresponding word in the recognition text meets a first preset condition, determining the corresponding word as a keyword; wherein the content of the first and second substances,

the determining the frequency of occurrence of the corresponding word in the recognized text comprises one of:

determining the occurrence frequency of the corresponding word in the second information base; the second information base comprises the first information base and word segmentation results corresponding to the recognition texts;

the determining the target segment by using the determined keyword comprises one of the following steps:

determining the keyword as the target segment;

In an embodiment, when determining the keyword, the first processing unit 602 is further configured to:

In an embodiment, the first processing unit 602 is further specifically configured to:

and segmenting the recognized text by using a preset segmentation model.

In an embodiment, the first processing unit 602 is further configured to:

alternatively, the first and second electrodes may be,

In an embodiment, when determining the first presentation format of the target segment, the second processing unit 603 is further configured to:

The functions of the acquiring unit 601, the first processing unit 602, and the second processing unit 603 are equivalent to the functions of the speech recognition module and the machine translation module of the data processing apparatus in the application embodiment.

In practical applications, the obtaining unit 601, the first processing unit 602, and the second processing unit 603 may be implemented by a processor in the data processing apparatus 600 in combination with a communication interface; the Processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable gate array (FPGA).

It should be noted that: the data processing apparatus 600 provided in the above embodiment is only illustrated by dividing the program modules in the simultaneous interpretation, and in practical applications, the above processing may be distributed and completed by different program modules according to needs, that is, the internal structure of the terminal may be divided into different program modules to complete all or part of the above-described processing. In addition, the apparatus provided in the above embodiments and the data processing method embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Based on the hardware implementation of the above-mentioned device, an electronic device is further provided in the embodiment of the present invention, fig. 7 is a schematic diagram of a hardware composition structure of the electronic device in the embodiment of the present invention, as shown in fig. 7, an electronic device 70 includes a memory 73, a processor 72, and a computer program stored on the memory 73 and capable of running on the processor 72, where when the processor 72 executes the program, the method provided by one or more of the above-mentioned technical solutions is implemented.

In particular, the processor 72 located at the electronic device 70, when executing the program, implements: acquiring voice data to be processed, and performing text recognition on the voice data to acquire a recognition text; the recognition text is used for presenting when the voice data is played; performing word segmentation on the recognition text to obtain at least one word; determining a keyword based on the occurrence frequency of the obtained at least one word in the identification text, and determining a target segment in the identification text by using the determined keyword; determining a first presentation format of the target segment to present the target segment in the first presentation format when presenting the recognized text; the first presentation format is different from the second presentation format; the second presentation format is the presentation format of other characters except the target segment in the recognition text.

It should be noted that, the specific steps implemented when the processor 72 located in the electronic device 70 executes the program have been described in detail above, and are not described herein again.

It is understood that the electronic device 70 further comprises a communication interface 71, and the communication interface 71 is used for information interaction with other devices; meanwhile, various components in the electronic device 70 are coupled together by a bus system 74. It will be appreciated that the bus system 74 is configured to enable connected communication between these components. The bus system 74 includes a power bus, a control bus, a status signal bus, and the like, in addition to the data bus.

It will be appreciated that the memory 73 in this embodiment can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a magnetic random access Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The described memory for embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiments of the present invention may be applied to the processor 72, or may be implemented by the processor 72. The processor 72 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 72. The processor 72 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 72 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located on a storage medium in memory where information is read by processor 72 to perform the steps of the methods described above in conjunction with its hardware.

The embodiment of the invention also provides a storage medium, in particular a computer storage medium, and more particularly a computer readable storage medium. Stored thereon are computer instructions, i.e. computer programs, which when executed by a processor perform the methods provided by one or more of the above-mentioned aspects.

In the embodiments provided in the present invention, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In addition, the technical solutions described in the embodiments of the present invention may be arbitrarily combined without conflict.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A data processing method, comprising:

2. The method of claim 1, wherein determining keywords based on the frequency of occurrence of the obtained at least one word in the identified text comprises:

3. The method of claim 1, wherein determining keywords based on the frequency of occurrence of the obtained at least one word in the identified text comprises:

4. The method of claim 3, wherein the determining the frequency of occurrence of the respective word in the recognized text comprises one of:

5. The method of claim 3 or 4, wherein in determining keywords, the method further comprises:

6. The method of claim 1, wherein determining the target segment in the recognition text using the determined keyword comprises one of:

determining the keyword as the target segment;

7. The method of claim 1, wherein the tokenizing the recognized text comprises:

and segmenting the recognized text by using a preset segmentation model.

8. The method of claim 1, wherein determining keywords based on the frequency of occurrence of the obtained at least one word in the identified text comprises:

alternatively, the first and second electrodes may be,

9. The method of claim 1, wherein determining the first presentation format for the target segment further comprises:

10. A data processing apparatus, comprising:

11. An electronic device, comprising: a processor and a memory for storing a computer program capable of running on the processor;

wherein the processor is adapted to perform the steps of the method of any one of claims 1 to 9 when running the computer program.

12. A storage medium storing a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 9 when executed by a processor.