WO2021134284A1

WO2021134284A1 - Voice information processing method, hub device, control terminal and storage medium

Info

Publication number: WO2021134284A1
Application number: PCT/CN2019/130075
Authority: WO
Inventors: 郝杰
Original assignee: 深圳市欢太科技有限公司; Oppo广东移动通信有限公司
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2021-07-08
Also published as: CN114503117A

Abstract

A voice information processing method, a hub device (1), a control terminal (2) and a storage medium. The method comprises: after a voice interview begins, receiving voice data, to be interpreted simultaneously, of a speaker transmitted by an acquisition terminal, and acquiring the acquisition time of starting to acquire said voice data (S101); determining identity information of the speaker on the basis of said voice data and a preset mapping relationship, and simultaneously interpreting said voice data into a target language of a listener in real time, so as to obtain interpretation information, the preset mapping relationship being a correlation between identity information of participants, target languages and voiceprint information of the participants, and the listener being a person other than the speaker among the participants (S102); recording said acquisition time, the identity information of the speaker and the interpretation information, so as to obtain one piece of record fragment information, and then acquiring at least one piece of record fragment information when the voice interview ends (S103); and generating, on the basis of the at least one piece of record fragment information, an interview record (S104).

Description

Voice information processing method, central equipment, control terminal and storage medium

Technical field

The embodiments of the present application relate to the field of voice processing technology, and in particular, to a voice information processing method, a central device, a control terminal, and a storage medium.

Background technique

As the trend of economic globalization continues to deepen, exchanges between different countries and different cultures have become more and more frequent.

In multi-person interview scenarios or meetings, since participants may come from different countries and regions, there will be barriers to communication with each other. In addition, after the interview, the translation and sorting of the interview records will also consume a lot of manpower, which is inefficient and is not conducive to the rapid release and dissemination of the interview content.

Summary of the invention

The embodiments of the present application expect to provide a voice information processing method, a central device, a control terminal, and a storage medium, which can generate interview records and improve the generation speed and processing efficiency of the interview records of voice interviews.

The technical solutions of the embodiments of the present application are implemented as follows:

The embodiment of the present application provides a voice information processing method, including:

After the start of the voice interview, receiving the voice data to be interpreted from the speaker transmitted by the collection terminal, and acquiring the collection time when the voice data to be transcribed was collected;

Based on the voice data to be interpreted and a preset mapping relationship, the identity information of the speaker is determined, and the voice data to be interpreted simultaneously is translated into the listener's target language in real time to obtain translation information; the preset mapping relationship is The correspondence between the identity information of the participant, the target language, and the voiceprint information of the participant; wherein the listener is a person other than the speaker among the participants;

Record the collection time, the speaker identity information, and the translation information corresponding to the voice data to be simultaneously translated to obtain one piece of recorded information, and then at the end of the voice interview, obtain at least one piece of recorded information;

Based on the at least one piece of record information, an interview record is generated.

In the above solution, the identity information of the speaker is determined based on the voice data to be simultaneously interpreted and the preset mapping relationship, and the voice data to be interpreted simultaneously is translated into the target language of the listener in real time to obtain the translation information, include:

From the voiceprint information of the participants in the preset mapping relationship, determine the target voiceprint information that matches the voiceprint of the voice data to be simultaneously transmitted;

Based on the target voiceprint information and the corresponding relationship between the participant identity information in the preset mapping relationship and the voiceprint information of the participant, the speech corresponding to the voiceprint of the voice data to be simultaneously transmitted is determined The identity information of the listener, and based on the corresponding relationship between the identity information of the participant in the preset mapping relationship and the target voice, obtain the listener's target language corresponding to the listener;

Translate the voice data to be simultaneously translated into the target language of the listener in real time to obtain the translation information.

In the above solution, the recording of the collection time, the speaker identity information, and the translation information corresponding to the voice data to be transcribed, obtains a piece of record information, which is then obtained at the end of the voice interview At least one record segment information, including:

Performing text recognition on the voice data to be simultaneously translated to obtain source text information;

Record the collection time, the speaker identity information, the translation information, and the source text information corresponding to the voice data to be simultaneously translated, until the target voiceprint information changes, obtain a piece of recorded information , And then at the end of the voice interview, the at least one recorded segment information is obtained.

In the above solution, the recording of the collection time, the speaker identity information, and the translation information corresponding to the voice data to be transcribed, obtains a piece of record information, which is then obtained at the end of the voice interview After recording at least one piece of information, the method further includes:

Using abstract extraction technology, abstract extraction is performed on the at least one record fragment information, and full text abstract information is extracted;

The generating an interview record based on the information of the at least one record segment includes:

Based on the at least one recorded segment information and the full-text summary information, the interview record is generated.

In the above solution, after the recording of the collection time, the speaker identity information, and the translation information corresponding to the voice data to be simultaneously translated, and obtaining a piece of record information, the method further includes:

Using abstract extraction technology, extract a summary of the information of a record segment, and extract the summary information of the speaker;

At the end of the voice interview, the at least one recorded segment information and the at least one speaker summary information are obtained.

In the above solution, the generating an interview record based on the information of the at least one record segment includes:

Based on the at least one recorded segment information and the at least one speaker summary information, the interview record is generated.

Receive interview generation instructions sent by the control terminal;

In response to the interview generation instruction, generating the interview record for the at least one piece of record information in the order of the time axis;

After the interview record is generated based on the at least one record piece information, the method further includes:

Send the interview record to the control terminal.

The embodiment of the present application also provides a voice information processing method, including:

Receive participant identity information, target language, and participant’s voiceprint information;

Sending the preset mapping relationship formed by the participant's identity information, the target language, and the participant's voiceprint information to the central device;

At the end of the interview, receive an interview trigger instruction, and generate an interview generation instruction in response to the interview trigger instruction;

Sending the interview generation instruction to the central device;

Receive the interview record of the central device's feedback on the interview generation instruction, the interview record is generated by the central device in response to the interview generation instruction based on the preset mapping relationship and the voice data to be simultaneously transmitted received in real time .

In the above solution, after the receiving the interview record of the instruction feedback generated by the central device for the interview, the method further includes:

The interview records are displayed in order of time axis.

In the above solution, each segment of the interview record in the interview record includes: speaker identity information, collection time, voice data to be simultaneously translated, and translation information.

In the above solution, the display of the interview records in the order of time axis includes:

Arrange each segment of the interview record in the interview record in the order of the time axis according to the collection time;

Display the speaker identity information, the voice data to be simultaneously translated, and the translation information corresponding to each segment of the interview record after the arrangement.

In the above solution, each segment of the interview record in the interview record further includes: speaker summary information; the speaker identity information corresponding to each segment of the interview record after the arrangement, the voice data to be transcribed, and After the translation information is displayed, the method further includes:

In the first preset area in the display area of each interview record, the speaker summary information is displayed.

In the above solution, the interview record further includes: full-text summary information; the speaker identity information, the voice data to be transcribed, and the translation information corresponding to each segment of the interview record after being arranged are displayed When, the method further includes:

The full text summary information is displayed in front of the speaker identity information, the voice data to be transcribed, and the translation information corresponding to each segment of the interview record after the arrangement.

Receive editing instructions;

In response to the editing instruction, the interview record is edited, and the final interview record is obtained and displayed.

Receive export instructions;

In response to the export instruction, process the interview record in a preset format to obtain an export file;

Share the exported file.

The embodiment of the application provides a hub device, including:

The first receiving unit is configured to receive the voice data for simultaneous interpretation of the speaker transmitted by the collection terminal after the voice interview starts, and obtain the collection time when the voice data for simultaneous interpretation starts to be collected;

The determining unit is configured to determine the identity information of the speaker based on the voice data to be simultaneously transmitted and the preset mapping relationship;

The translation unit is used for real-time translation of the voice data to be simultaneously translated into the target language of the listener to obtain translation information; the preset mapping relationship is between the identity information of the participant, the target language and the voiceprint information of the participant Correspondence relationship; wherein, the listener is a person other than the speaker among the participants;

The recording unit is used to record the collection time, the speaker identity information, and the translation information corresponding to the voice data to be simultaneously translated to obtain a piece of recorded piece information, and then at the end of the voice interview, at least one piece of information is obtained Record fragment information;

The first generating unit is configured to generate interview records based on the at least one record segment information.

The embodiment of the present application provides a control terminal, including:

The second receiving unit is used to receive the participant's identity information, the target language and the participant's voiceprint information;

A mapping unit, configured to send the preset mapping relationship formed by the participant's identity information, the target language, and the participant's voiceprint information to the central device;

The second receiving unit is further configured to receive an interview trigger instruction at the end of the interview;

The second generation unit is configured to generate an interview generation instruction in response to the interview trigger instruction;

The second sending unit is configured to send the interview generation instruction to the central device;

The second receiving unit is configured to receive an interview record of the central device responding to the interview generation instruction feedback, the interview record being the central device responding to the interview generation instruction, based on the preset mapping relationship and real-time Generated by the received voice data to be interpreted simultaneously.

The embodiment of the present application also provides a hub device, including:

The first processor and the first memory;

The first processor is configured to execute the simultaneous interpretation program stored in the first memory to implement the voice information processing method on the central device side.

The embodiment of the present application also provides a control terminal, including:

A second processor and a second memory;

The second processor is configured to execute the simultaneous interpretation program stored in the second memory to implement the voice information processing method on the control terminal side.

An embodiment of the present application provides a storage medium on which a simultaneous interpretation program is stored, and when the simultaneous interpretation program is executed by a first processor, the voice information processing method on the central device side is implemented; or, the When the simultaneous interpretation program is executed by the second processor, the voice information processing method on the control terminal side is realized.

The embodiment of the application expects to provide a voice information processing method, a central device, a control terminal, and a storage medium, including: after the voice interview starts, receiving the voice data of the speaker to be simultaneously transmitted transmitted by the collection terminal, and obtaining the voice data of the speaker to be synchronized. The collection time of the transmitted voice data; based on the voice data to be simultaneously interpreted and the preset mapping relationship, determine the identity information of the speaker, and treat the simultaneous interpretation of the simultaneous voice data into the listener's target language in real time to obtain the translation information; the preset mapping relationship is The corresponding relationship between the identity information of the participant, the target language, and the voiceprint information of the participant; among them, the listener is the participant other than the speaker; the collection time and the speaker corresponding to the voice data to be transcribed are recorded The identity information and the translation information are used to obtain a piece of recorded information, and then at the end of the voice interview, at least one piece of recorded information is obtained; based on the at least one piece of recorded information, an interview record is generated. By adopting the above technical implementation scheme, since the central device can determine the identity information of the speaker according to the voice data of the speaker to be transcribed in the voice interview scene, and obtain the translation information in the language that meets the needs of the listener, at the end of the interview, The interview record for this interview can be generated based on the above information, so that while the central device performs real-time simultaneous interpretation of the voice data to be simultaneously interpreted, it can also record the identified speaker identity information, translation information and other structured data. As the recorded information, at the end of the interview, multiple recorded information will be obtained to generate the interview record of the voice interview. Therefore, the efficiency of data collation in the voice interview is improved, that is, the interview record of the voice interview is improved. Generation speed and processing efficiency.

Description of the drawings

FIG. 1 is an architecture diagram of a voice information processing system provided by an embodiment of the application;

FIG. 2 is a first schematic flowchart of a voice information processing method provided by an embodiment of this application;

FIG. 3 is a second schematic flowchart of a voice information processing method provided by an embodiment of this application;

FIG. 4 is a third schematic flowchart of a voice information processing method provided by an embodiment of this application;

FIG. 5 is a first schematic flowchart of a voice information processing method provided by an embodiment of this application;

FIG. 6 is a second schematic flowchart of a voice information processing method provided by an embodiment of this application;

FIG. 7 is a schematic diagram 1 of an exemplary display interface of interview records provided by an embodiment of the application; FIG.

Fig. 8 is a second schematic diagram of an exemplary display interface for interview records provided by an embodiment of the application;

FIG. 9 is a third schematic diagram of an exemplary display interface for interview records provided by an embodiment of the application; FIG.

FIG. 10 is a fourth schematic diagram of a display interface of an exemplary interview record provided by an embodiment of the application; FIG.

FIG. 11 is an interaction diagram of a voice information processing method provided by an embodiment of this application;

FIG. 12 is a schematic diagram 1 of the composition structure of a hub device provided by an embodiment of the application; FIG.

FIG. 13 is a second schematic diagram of the composition structure of the hub device provided by an embodiment of the application;

FIG. 14 is a schematic diagram 1 of the composition structure of a control terminal provided by an embodiment of the application;

FIG. 15 is a second schematic diagram of the composition structure of a control terminal provided by an embodiment of the application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. It is understandable that the specific embodiments described here are only used to explain the related application, but not to limit the application. In addition, it should be noted that, for ease of description, only the parts related to the relevant application are shown in the drawings.

The embodiment of the application provides a voice information processing method, which is implemented by a voice information processing device. The voice information processing device provided in the embodiment of the application may include a central device, a control terminal, and a transceiver integrated terminal (including a collection terminal and a receiving terminal).

Figure 1 is a schematic structural diagram of a voice information processing system applied by a voice information processing method; as shown in Figure 1, the voice information processing system may include: a central device 1, a control terminal 2, multiple transceiver integrated terminals 3 (including collection terminals 3- 1 and the receiving terminal 3-2).

In scenarios such as multi-person interviews or multi-person conferences, the speaker (speaker) can use the integrated transceiver terminal 3 (ie collection terminal 3-1) that he wears to give conference lectures. In the process of conducting conference lectures, the collection terminal 3- 1 Collect the speaker's voice data (that is, the voice data to be simultaneously transmitted), and transmit the voice data to be simultaneously transmitted to the hub device 1 in real time. When the hub device 1 obtains the voice data to be synchronized, it will start to receive the voice data to be synchronized. The time of simultaneous voice data is acquired as the collection time, and then based on the voice data to be simultaneously transferred and the preset mapping relationship, the identity information of the speaker who is speaking or speaking is determined, and at the same time, the voice data to be simultaneously transferred is determined in real time. Simultaneous interpretation is translated into the target language of the listener, so as to obtain the translation information (specifically according to the language required by the listener, the corresponding translation result of the corresponding language is sent); the translation information is simultaneously transmitted to the integrated terminal of the listeners participating in the meeting in real time (ie receiving On the terminal 3-2), the voice data to be interpreted corresponding to each speaker, the collection time, speaker identity information, and translation information, etc., are recorded as the record fragment information corresponding to the speaker. In addition, At the end of the meeting, the central device 1 can receive the interview generation instruction sent by the control terminal 2, and according to the interview generation instruction, generate interview records for all the recorded pieces of information obtained in the meeting, and finally send the interview records to the control terminal 2. The control terminal 2 can display the interview record, or share the interview record with the user terminal owned by the participant participating in the meeting through the control terminal 2, so that the participant can access or browse the meeting record.

FIG. 2 is a first schematic flowchart of a voice information processing method provided by an embodiment of the application. As shown in Figure 2, when applied to a hub device, the voice information processing method includes the following steps:

S101. After the start of the voice interview, receive the voice data to be interpreted from the speaker transmitted by the collection terminal, and obtain the collection time when the voice data to be transcribed starts to be collected.

The voice information processing method provided by the embodiment of this application can be applied to international conferences, international interviews, or various scenarios that require simultaneous interpretation and translation, and the embodiments of this application are not limited.

It should be noted that in the embodiments of the present application, application scenarios can also be divided into large-scale international conferences, small-scale work conferences, public service venues, public social venues, social applications, and general scenarios. Among them, public service places may be waiting halls, government office halls, etc., and public social places may be coffee shops, concert halls, etc. The actual application scenario corresponding to the voice data to be synchronized is actually the specific application scenario in which the voice data to be synchronized is collected. The specific actual application scenario is not limited in the embodiment of this application.

In the embodiments of this application, the hub device communicates with the integrated terminal and the control terminal. The integrated terminal is a transceiver integrated terminal worn by participants participating in a voice interview, such as a headset/microphone integrated terminal. The integrated terminal worn by each speaker can also be called In order to be a collection terminal, the all-in-one terminal worn by other listeners at this time can be called a receiving terminal.

Wherein, the communication mode may be wireless communication technology, wired communication technology, near field communication technology, etc., for example, Bluetooth or Wi-Fi, etc., which is not limited in the embodiment of the present application.

It should be noted that the integrated terminal has both a headset and a microphone. The microphone is used to collect voice data for simultaneous interpretation when speaking, and the headset is used to play translation information when listening. Therefore, each participant can be either a speaker or a listener, and the specific decision is based on actual conditions, and the embodiment of the present application does not limit it.

After the start of the voice interview, the speaker uses the collection terminal to collect the voice data to be synchronized, and transmits the voice data to be synchronized to the central device in real time. The hub device also obtains the time when the voice data to be synchronized is received. , That is, the acquisition time.

It should be noted that the voice data of each speaker to be simultaneously transmitted is transmitted in real time, but the central device only obtains the time when each speaker starts to speak, that is, the collection time. In the embodiment of the present application, the voice data to be simultaneously translated may be any voice that requires voice translation, for example, voice collected in real time in an application scenario. In addition, the voice data to be interpreted can be voices in any type of language. The specific voice data to be simultaneously transmitted is not limited in this embodiment of the application.

In the embodiments of this application, after the interview starts, there may be multiple speakers who speak. The speakers in this application are the people who speak each time among the participants, and the embodiments of this application are not limited.

S102. Determine the identity information of the speaker based on the voice data to be simultaneously interpreted and the preset mapping relationship, and treat the simultaneous interpretation of the voice data to the listener's target language in real time to obtain the translation information; the preset mapping relationship is the identity information of the participants, The corresponding relationship between the target language and the voiceprint information of the participants; among them, the listener is a person other than the speaker among the participants.

After the hub device obtains the voice data to be simultaneously transmitted, the preset mapping relationship is stored in the hub device, and the preset mapping relationship is the corresponding relationship between the identity information of the participant, the target language, and the voiceprint information of the participant. , The hub device can first find the target voiceprint information matching the voice data to be simultaneously transmitted based on the voiceprint information of each participant stored based on the preset mapping relationship, and then based on the difference between the participant’s identity information and the participant’s voiceprint information Find the participant identity information corresponding to the target voiceprint information, that is, the speaker’s identity information, and confirm the listener’s identity information, and then determine the listener’s target language from the correspondence between the participant’s identity information and the target language , And then translate the voice data to be simultaneously translated into the translation information in the target language of each listener corresponding to each listener, so that each listener can use their familiar voice and hear the speaker's speech through the receiving terminal. Among them, the listener is a person other than the speaker among the participants.

In the embodiment of the present application, the speaker identity information may be the name or the unique identifier of the speaker, which is not limited in the embodiment of the present application.

It should be noted that the preset mapping relationship is the corresponding relationship between the participant's identity information, the target language, and the participant's voiceprint information, which can be expressed as the preset mapping relationship including the corresponding relationship between the participant's identity information and the target language, The corresponding relationship between the participant's identity information and the participant's voiceprint information, the relationship between the target language and the participant's voiceprint information, the participant's voiceprint information database, the participant's identity information database, and the target language database.

That is to say, in this embodiment of the application, the hub device determines the target voiceprint information that matches the voiceprint of the voice data to be simultaneously transmitted from the voiceprint information of the participants in the preset mapping relationship; based on the target voiceprint information and The corresponding relationship between the participant's identity information and the participant's voiceprint information in the preset mapping relationship is determined, and the speaker identity information corresponding to the voiceprint of the voice data to be transcribed is determined, and based on the participant in the preset mapping relationship The corresponding relationship between the identity information and the target voice is obtained, and the target language of the listener corresponding to the listener is obtained; finally, the voice data to be simultaneously translated is translated into the target language of the listener in real time to obtain the translation information.

In some embodiments of the present application, after the hub device has learned the correspondence between the target language and the voiceprint information of the participants, after determining the target voiceprint information, since the participants are the listeners except for the speaker Therefore, after the target voiceprint information corresponding to the speaker is determined, the target language corresponding to other unmatched voiceprint information is the listener's target language, so that the listener's target language corresponding to the listener can be determined.

Exemplarily, assuming that the participants are A, B, and C, and A is speaking, when it is determined that A is the speaker, then B and C are the listeners, so that the target language of A, the target language of B, and the target language of C can be selected Find the target languages corresponding to listeners B and C.

In some embodiments of the present application, the hub device has built-in functions such as speech recognition (ASR, Automatic Speech Recognition), speech synthesis (TTS, Text-To-Speech), voiceprint recognition, translation, and recording (support online or offline) Mode), which has networking and communication functions, and can interact with control terminals and integrated terminals.

It is understandable that the central equipment combines voiceprint recognition, ASR technology, machine translation technology, and TTS technology to build a simultaneous interpretation system in interview scenarios, which solves the communication barriers between different languages.

In the embodiment of this application, the hub device uses the voiceprint recognition technology to identify the target voiceprint information matching the voice data to be simultaneously transmitted from the voiceprint information of the participants in the preset mapping relationship, and adopts Machine translation technology translates the voice data to be simultaneously interpreted into the listener's target language in real time, and obtains the translation information that each listener can understand.

In some embodiments of the present application, after the central device translates the translation information required by each listener in real time, it can send the translation information to the receiving terminal of the listener participating in the voice interview in real time for the listener to use their corresponding receiving terminal Hear the spoken information in the language you are familiar with, that is, translate the voice data.

It should be noted that when the hub device communicates with each integrated terminal, the corresponding relationship between each integrated terminal and the participant is stored, that is, the hub device can accurately send the data to be sent to the participant to the corresponding integrated terminal. In the terminal. In this way, the hub device can correspondingly send the respective translation information obtained according to the target language of each listener to the receiving terminal of the listener.

In some embodiments of the present application, the translation information includes translated text information and translated voice data. This is the central device that uses TTS technology to convert the translated text information into translated voice data after translating the voice data to be transcribed into translated text information. In this way, the central device can send the translated voice data and the translated text information to the receiving terminal of the listener participating in the voice interview in real time.

In the embodiment of the present application, if a display device is provided on the integrated terminal, the translated text information may also be displayed on the display for the listener to watch. The specific implementation of the embodiment of the present application is not limited.

S103. Record the collection time, speaker identity information, and translation information corresponding to the voice data to be simultaneously translated to obtain one piece of recorded information, and then at the end of the voice interview, obtain at least one piece of recorded information.

After the central device obtains the translation information corresponding to each listener, it can obtain a recorded segment information of the speaker by recording the collection time, speaker identity information and translation information corresponding to the voice data to be simultaneously interpreted, and then continue to the next speech Obtain the recorded segment information of the speaker, so that at the end of the voice interview, the central device can obtain at least one recorded segment information corresponding to different speakers.

It should be noted that a piece of recording information may include voice data to be interpreted, speaker identity information, translation information, and collection time. Each record fragment can use fields for data recording.

In some embodiments of the present application, the hub device can perform text recognition on the voice data to be interpreted to obtain source text information; record the collection time, speaker identity information, translation information, and source text information corresponding to the voice data to be interpreted simultaneously, Until the target voiceprint information changes, one piece of recorded information is obtained, and then at the end of the voice interview, at least one piece of recorded information is obtained.

Further, in this embodiment of the present application, a piece of record segment information may also include: source text information.

It should be noted that the central device adopts ASR technology to convert the voice data to be transcribed into source text information.

In the embodiment of this application, in a multi-person interview scenario, multiple participants are generally speaking alternately and alternately, and it is also possible that a certain speaker is interrupted while speaking; being interrupted can be considered as a switch of the speaking object. The division of recorded segments is based on the speaker switching or ending. That is to say, the central device recognizes the speaker's identity based on the voiceprint recognition technology for the voice data to be transmitted in real time, and when the identity of the speaker at a certain moment is received changes, it represents a speaker When the speech is over, the next speaker starts to speak, so it returns to the implementation of S101, and the information corresponding to the previous speaker is recorded as the above-mentioned record fragment information, so that at the end of the interview, the central device can obtain at least one Record fragment information.

It should be noted that at least one recorded segment information may include recorded segment information of the same speaker at different moments, which is based on actual recording conditions and is not limited in the embodiment of the present application.

Exemplarily, before the start of the interview, first enter the voiceprint and name of each participant. After the start of the interview, every sentence spoken by each person will be saved. In the unit of a complete segment, the information of a record segment can be recorded as shown in Table 1:

Table 1

字段Field	字段的意义The meaning of the field	To
说话人名字Speaker's name	发音人名字Speaker's name	根据声纹识别得到According to the voiceprint recognition
时间戳Timestamp	发言开始时间戳Speech start timestamp	To
音频Audio	发言对应的源语言音频The source language audio corresponding to the speech	麦克风采集得到Microphone collection
发言文本Speech text	源语言发言音频对应的文本The text corresponding to the source language speech audio	ASR得到ASR gets
翻译文本Translated text	翻译为目标语言的发言文本Speaking text translated into the target language	翻译得到Translate

Among them, the speaker's name is the speaker's identity information, the time stamp is the collection time, the audio is the voice data to be interpreted simultaneously, the speech text is the source text information, and the translated text is the translation information.

S104: Generate an interview record based on the information of at least one record segment.

At the end of the interview, the central device records at least one recorded piece of information, so that the central device can generate an interview record for the interview based on the at least one recorded piece of information.

In this embodiment of the application, the hub device can communicate with the control terminal, and the control terminal is used to receive some conventional settings for input, such as the target language, the number of people, and the listening language of each integrated terminal. In addition, the control terminal can also control the realization of the functions of the central device. For example, the function of generating interview records.

In some embodiments of the present application, the hub device may receive an interview generation instruction sent by the control terminal; in response to the interview generation instruction, generate interview records for at least one piece of record information in the order of the time axis; the hub device sends the interview records to the control terminal.

It should be noted that the control terminal side can be provided with an input device. After the voice interview is over, the user can generate an interview generation instruction through the input device, and then send the interview generation instruction to the central device, so that the central device has recorded Regarding at least one recorded segment information of the voice interview, the central device generates an interview record with at least one recorded segment information, and then sends the interview record to the control terminal for the control terminal to present the interview record.

It is understandable that because the central device can determine the identity information of the speaker and obtain the translation information in the language that meets the needs of the listener based on the voice data of the speaker in the voice interview scene. Based on the above information, the interview record for this interview is generated, so that the central device can perform real-time simultaneous interpretation of the voice data to be simultaneously interpreted, and at the same time, it can also record the identified speaker identity information, translation information and other structured data. Record fragment information, and finally at the end of the interview, multiple recorded fragment information will be obtained to generate the interview record of the voice interview. Therefore, the efficiency of data collation in the voice interview is improved, that is, the generation of the interview record of the voice interview is improved Speed and processing efficiency.

In some embodiments of the present application, as shown in FIG. 3, after S103, an embodiment of the present application further provides a voice information processing method, including: S105-S106. as follows:

S105. Use abstract extraction technology to abstract at least one piece of record information to extract full text abstract information;

S106: Generate an interview record based on at least one piece of record information and full text summary information.

At the end of the interview, the hub device recorded at least one piece of record information. In addition, the hub device can also use abstract extraction technology (such as the TextRank algorithm) to extract at least one piece of record information to extract the full text summary information. The summary information represents the summary of the main content discussed by the speakers in this voice interview, that is, the full-text summary information is the full-text summary extracted after summarizing all the speeches of each speaker. In this way, the central device can generate an interview record for this interview based on at least one piece of record information and full-text summary information.

In the embodiments of the present application, the full text summary information can be placed at the beginning of the interview record to facilitate the user to roughly understand the content of the interview and decide whether to continue reading the interview record.

It should be noted that in the era of information explosion, users need to obtain information quickly. When a spokesperson has a long speech segment, abstract extraction technology can be used to extract the core summary, which improves the efficiency of users reading interview records.

In some embodiments of the present application, as shown in FIG. 4, the recording in S103 corresponds to the collection time, speaker identity information, and translation information corresponding to the voice data to be interpreted. After obtaining a piece of record information, it further includes: S107 -S109. as follows:

S107. Using the abstract extraction technology, perform abstract extraction on a piece of record information, and extract the summary information of the speaker.

S108. At the end of the voice interview, obtain at least one recorded segment information and at least one speaker summary information;

S109. Generate an interview record based on the at least one record segment information and the at least one speaker summary information.

In the embodiment of this application, after the hub device records the collection time, speaker identity information, and translation information corresponding to the voice data to be simultaneously translated, and obtains a piece of record information, the hub device can use the abstract extraction technology to compare a piece of record information. Perform abstract extraction to extract the speaker summary information, that is, after each recorded segment information is generated, the speaker summary information of each recorded segment information can be extracted at the same time. In this way, at the end of the voice interview, the At least one recorded segment information and at least one speaker summary information; finally, the hub device can generate an interview record based on at least one recorded segment information and at least one speaker summary information.

In some embodiments of the present application, the hub device may also perform summary extraction on each recorded segment information after acquiring at least one recorded segment information to obtain at least one speaker summary information, which is not limited in the embodiment of the present application.

It is understandable that the central device can summarize the central idea of each recorded piece of information into speaker summary information, so that the reading efficiency of each spoken information in the generated interview record can be improved.

In some embodiments of the present application, the generation of interview records can also be based on full-text summary information, at least one speaker summary information, and at least one piece of record information at the same time. In this way, the generated interview records not only include the content summary of the full text, but also The summary of the content of each speech information greatly improves the embodiment of the main ideas in the interview record, thereby increasing the richness and diversity of the interview record.

In some embodiments of the application, the hub device may also be provided with a speaker or connected to a speaker. The control terminal can control the playback language of the speaker, and the speaker's speech can be converted into the playback language through the speaker. Play it.

FIG. 5 is a first schematic flowchart of a voice information processing method provided by an embodiment of this application. As shown in Figure 5, when applied to a control terminal, the voice information processing method includes the following steps:

S201: Receive participant identity information, target language, and participant's voiceprint information.

S202: Send the preset mapping relationship formed by the participant's identity information, the target language, and the participant's voiceprint information to the central device.

In the voice interview scenario, the control terminal may be a smart device installed with a designated application (such as a simultaneous interpretation application that implements the voice processing method provided in this application), such as a smart phone, a tablet computer, or a computer, which is not limited in the embodiment of this application. , The control terminal can communicate with the central device. The control terminal can be provided with an input device on the control terminal side, which can input some general settings, such as the target language, the number of people, and the listening language of each earphone microphone terminal.

In the embodiment of the present application, before the voice interview is conducted, the user can enter the relevant information of the participant through the control terminal, for example, the identity information of the participant and the target language; at the same time, the participant’s voice can also be collected through the integrated terminal recording. After the voiceprint information of the participant is sent to the control terminal through the central device, the control terminal corresponds the participant’s identity information, the participant’s voiceprint information, and the target language to form a preset mapping relationship. Finally, the control terminal can The preset mapping relationship is sent to the central device for use by the central device.

In some embodiments of the present application, the control terminal may also send participant identity information and target language to the central device, and the integrated terminal sends the participant’s voiceprint information to the central device, and the central device transmits the participant’s identity information and participation The voiceprint information of the speaker and the target language correspond to form a preset mapping relationship. Therefore, the process of obtaining the preset mapping relationship can be selected according to actual conditions, and the embodiment of the present application does not limit it.

Among them, the participant's identity information may be the participant's name, or a unique identification, etc., which is not limited in the embodiment of the present application.

It should be noted that in this embodiment of the application, the preset mapping relationship is the corresponding relationship between the participant's identity information, the target language, and the participant's voiceprint information, which can be expressed as the preset mapping relationship includes the participant's identity information Correspondence with the target language, the corresponding relationship between the participant's identity information and the participant's voiceprint information, the relationship between the target language and the participant's voiceprint information, the participant's voiceprint information database, the participant's identity information database and the target language Library etc.

S203. At the end of the interview, an interview trigger instruction is received, and an interview generation instruction is generated in response to the interview trigger instruction.

S204. Send the interview generation instruction to the central device.

S205. Receive the interview record of the central device's response to the interview generation instruction feedback. The interview record is generated by the central device in response to the interview generation instruction, based on the preset mapping relationship and real-time received voice data for simultaneous interpretation.

In the embodiment of this application, the control terminal can control the central device to generate interview records. At the end of the interview, the user can trigger the interview record generation function through the control terminal, generate interview trigger instructions, generate interview generation instructions, and generate interviews The instruction is sent to the central device, so that the central device can record the determined speaker identity information, translation information and other structured data as the recorded fragment information while performing real-time simultaneous interpretation of the voice data to be simultaneously interpreted. One piece of record information, then, the central device generates interview records from at least one piece of record information, and then sends the interview records to the control terminal for the control terminal to present the interview records.

Among them, the interview record is generated by the central device in response to the interview generation instruction, based on the preset mapping relationship and the voice data to be simultaneously transmitted received in real time.

It is understandable that the control terminal can obtain the interview record from the central device. The interview record is the information that records the content of the voice interview. In this way, the user can obtain or watch the interview record through the control terminal, which is convenient and fast. A control terminal is provided. The intelligence.

In some embodiments of the present application, as shown in FIG. 6, after S205, it further includes: S206, or, S207-S208, or, S209-S211. as follows:

S206: Display the interview records in order of time axis.

After the control terminal obtains the interview record, because the spokesperson’s speech related information in the voice interview recorded in the interview record, and the speech sequence of the spokesperson is temporal, the content contained in the interview record can also be temporal In this way, the control terminal can display the interview records in the order of the time axis.

In the embodiment of the present application, in the case that each speech corresponds to a segment of the interview record in the interview record, each segment of the interview record in the interview record may include: speaker identity information, collection time, voice data to be simultaneously translated, and translation information. In this way, the control terminal can arrange each interview record in the interview record in the order of the time axis according to the collection time; and display the identity information of the speaker, the voice data to be transcribed and the translation information corresponding to each interview record after the arrangement. .

It should be noted that the voice data and translation information corresponding to each interview record after the arrangement can be arranged in the corresponding area of the speaker’s identity information through the function buttons. When the function button is triggered, the corresponding function button is displayed. The content can be. Among them, the text information to be translated into simultaneous voice data, that is, the source text information, can also be transmitted by the central device to the control terminal. The translation information can also include: translated text information and translated voice data, which is convenient for display and function The key can be set to one type of content corresponding to one, or multiple content corresponding to two keys can be combined to realize the joint implementation. For example, source text information corresponds to a function button 1, translated text information corresponds to a function button 2, and voice data corresponds to a function button 3. When 1 and 3 are triggered at the same time, the voice data to be transmitted will be played; when 2 and 3 are triggered at the same time , Play the translated voice data; when triggered by 1, display the source text information; when triggered by 3, display the translated text information.

In some embodiments of the present application, a comparison button can also be set to simultaneously display source text information and translate text information, etc. The embodiments of this application do not limit the setting method and arrangement of the buttons based on the content displayed in the interview record.

It should be noted that for different listeners, due to the different target languages of the listeners, the translated information in the generated interview records will be different. Therefore, you can generate your own corresponding interview records for each listener, and you can also All speeches are translated into a common language, which is not limited in the embodiments of this application.

Exemplarily, the speaker’s language is English and the translation language is Chinese. As shown in Figures 7 and 8 for the display interface of interview records, the control terminal will obtain Xxx interview records. The display interface is equipped with four function buttons: "original", "translation", "sound", and "contrast". The speakers include ：Speaker A, Speaker B and Speaker C. There are four function buttons "original", "translation", "sound" and "contrast" in the corresponding area 1 behind each speaker's identity information. In area 2, collection time is set (for example, Speaker A: 2019.08.31 22:10:15; Speaker B: 2019.08.31 22:12:10; Speaker C: 2019.08.31 22:13:08), that is The start time of the speech, and the interview records are arranged according to the time of the speech to display each interview record. When the "translation" corresponding to Speaker A is triggered, the translated text information is displayed in area 3: "I think China will be far ahead in the 5G competition! In the next one to two years, 5G will begin to be applied and achieved explosive growth. increase". When the "original" corresponding to Speaker B is triggered, the source text information is displayed in area 4: "I agree with you very much, I think 5G will bring a lot of new opportunities". When the "tone" corresponding to Speaker C is triggered, the translated voice data is played in area 5. When the "control" corresponding to Speaker A is triggered, the source text information is displayed in area 6: "I believe that China will lead in the 5G competition! In the next one to two years, 5G will begin to apply and achieve explosive growth" Contrast with the translated text message: "I think China will be far ahead in the 5G race! In the next one to two years, 5G will begin to be applied and achieve explosive growth". As can be seen from the figure, the interview record of each speaker's speech has a time stamp, and you can choose the original text, translation, audio, and comparison options.

In some embodiments of the application, each segment of the interview record in the interview record also includes: speaker summary information; the control terminal will arrange each segment of the interview record corresponding to the speaker's identity information, the voice data to be interpreted, and translation information After the display, the control terminal may also display the speaker summary information in the first preset area in the display area of each interview record.

It should be noted that, in this embodiment of the application, if the speaker’s speech fragment is relatively long (for example, more than 70 words), the central device generates a speech fragment summary (that is, speaker summary information), and carries the speaker summary information in The interview record is sent to the control terminal for readers to select and read the speaker's summary information on the control terminal, so as to select the selection of the specific speaker's speech information.

It is understandable that the central device can summarize the central idea of each recorded piece of information into speaker summary information, so that the speaker summary information displayed by the control terminal can allow readers to quickly understand the main content of each interview record, thereby Improve the reading efficiency of each statement in the generated interview records.

Exemplarily, as shown in Figure 9, for Speaker A's speech, the translated text information is: "I think China will be far ahead in the 5G competition! In the next one to two years, 5G will begin to be applied and achieve a burst In the previous 234G era, China has always been in a passive state, which makes foreigners think that China has no examples to develop 5G first. Unexpectedly, China has become a leader in the 5G era.” The abstract of the speaker extracted was “A It is believed that China’s 5G is in a leading position and has achieved large-scale growth”, so that the control terminal can display the speaker summary in area 11 on the display interface of the interview record.

It should be noted that the speaker summary of each speaker can be displayed together with each interview record, or it can be displayed in the area E through the user's manipulation on the display interface of the interview record. The specific implementation method is this The application examples are not limited.

In some embodiments of the application, the interview record also includes: full-text summary information; when the control terminal displays the speaker identity information, the voice data to be simultaneously translated and the translation information corresponding to each segment of the interview record after being arranged, it controls at the same time The terminal can also display the full-text summary information in front of the speaker identity information, the voice data to be transcribed, and the translation information corresponding to each interview record after the arrangement.

Among them, the central device can use abstract extraction technology (such as the TextRank algorithm) to abstract at least one recorded segment information to extract full-text summary information, where the full-text summary information represents the main content of the speakers in the voice interview Summary, that is, the full-text summary information is the full-text summary extracted after summarizing all the speeches of each speaker. In this way, the central device can generate an interview record for this interview based on at least one recorded fragment information and full-text summary information, and send it to the control terminal.

Exemplarily, as shown in Figure 10, for the Xxx interview record, in the full-text record, that is, in the frontmost area F where the interview record of multiple fragments is displayed, the full-text summary information of the voice interview is displayed: "Full-text summary : In the new era of 5G, ABC has issued an important outlook. A believes that China’s 5G is far ahead and will bring infinite opportunities for large-scale commercialization. BC agrees with this.”

S207. Receive an editing instruction;

S208. In response to the editing instruction, edit the interview record to obtain the final interview record and display it.

After the control terminal receives the interview record, there may still be errors in the interview record processed by the machine or need to be manually added and polished. At this time, the control terminal also provides an editable function. When the user triggers the editable function to be turned on, That is, the control terminal receives the editing instruction, so in the display interface of the interview record, it responds to the editing instruction, obtains the editing information, edits the interview record according to the obtained editing information, and obtains the final interview record and displays it.

It is understandable that the editing function improves the content of the interview record, eliminates some errors, etc., and the final interview record obtained is more accurate and complete.

S209. Receive an export instruction;

S210. Responding to the export instruction, process the interview record in a preset format to obtain an export file;

S211. Share the exported file.

After the control terminal receives the interview record, there may be a need to share the interview record. At this time, the control terminal also provides a sharing function. When the user triggers the sharing function to turn on, the control terminal receives the export command. In the display interface of the interview record, respond to the export instruction, obtain the export format, and perform the preset format processing on the interview record according to the obtained export format to obtain the export file, and finally share the export file.

In the embodiment of the application, the export format may include: HTML format, txt format, PDF format, etc., any text format or web page format, etc., and the embodiment of the application does not limit it.

It should be noted that when the control terminal exports the interview records, it can export different formats according to different purposes and platforms. Such as exporting plain text in txt format, exporting to PDF format for archiving, exporting to HTML format for sharing, etc.

In some embodiments of the present application, the voice data related to this interview can also be stored in the central device or the cloud, so that when the control terminal exports the interview record, it can also attach a link to share the voice, so that the interview record and the voice can be combined. The link is shared with others, so that others can obtain the text information and voice information of the interview record.

It is understandable that the sharing function provides the function of sharing interview records, which improves the intelligence and diversity in the voice interview scene.

It is understandable that the voice information processing method provided in this application solves the problem of interview record generation, greatly improves the efficiency of post-processing reports, and provides convenient functions such as original text, translation, comparison, audio, and abstract.

The embodiment of the present application provides a voice information processing method, as shown in FIG. 11, including:

S301: The control terminal receives the participant's identity information, the target language, and the participant's voiceprint information.

S302: The control terminal sends the preset mapping relationship formed by the participant's identity information, the target language, and the participant's voiceprint information to the central device.

S303. After the voice interview starts, the central device receives the voice data for simultaneous interpretation of the speaker transmitted by the collection terminal from the collection terminal, and obtains the collection time of the voice data for simultaneous interpretation.

S304. Determine the identity information of the speaker based on the voice data to be simultaneously interpreted and the preset mapping relationship, and treat the simultaneous interpretation of the voice data into the listener's target language in real time to obtain the translation information; the preset mapping relationship is the identity information of the participants, The corresponding relationship between the target language and the voiceprint information of the participants; among them, the listener is a person other than the speaker among the participants.

S305. The central device sends the translation information to the receiving terminal in real time, so that the receiving terminal can play the translation information.

S306. The hub device records the collection time, speaker identity information, and translation information corresponding to the voice data to be simultaneously translated to obtain a piece of recorded information, and then at the end of the voice interview, obtain at least one piece of recorded information.

S307. The central device receives the interview generation instruction sent by the control terminal; in response to the interview generation instruction, generates an interview record for at least one piece of record information in the order of the time axis.

S308. The central device sends the interview record to the control terminal.

S309. The control terminal displays the interview record.

Exemplarily, the participant’s name (participant identity information) and language (target language) are entered through the control terminal, and the participant’s voiceprint information is collected through the headset/microphone integrated terminal, and the participant’s identity information and target language are recorded through the central device. Corresponding to the voiceprint information of the participants, the preset mapping relationship is obtained; the simultaneous interpretation module in the central device starts to work at the beginning of the interview, and the speaker's simultaneous interpretation voice data is recorded through the headset/microphone integrated terminal and sent to the central device in real time , The central device completes text writing, translation, recording, and translation voice generation according to the preset mapping relationship, and obtains the recorded fragment information, and then sends the translated voice to the headset/microphone integrated terminal of other people (listeners), and the listener can Hear the corresponding translated audio. At the end of the interview, through the control of the control terminal, the central device will generate interview records from multiple recorded pieces of information, and finally edit the interview records through the control terminal to obtain the final interview records and display them.

As shown in FIG. 12, an embodiment of the present application provides a hub device 1, and the hub device 1 may include:

The first receiving unit 10 is configured to receive the voice data for simultaneous interpretation of the speaker transmitted by the collection terminal after the voice interview starts, and obtain the collection time when the voice data for simultaneous interpretation starts to be collected;

The determining unit 11 is configured to determine the identity information of the speaker based on the voice data to be simultaneously transmitted and the preset mapping relationship;

The translation unit 12 is used for real-time translation of the voice data to be simultaneously interpreted into the target language of the listener to obtain translation information; the preset mapping relationship is among the identity information of the participant, the target language, and the voiceprint information of the participant Correspondence between; wherein, the listener is a person other than the speaker among the participants;

The recording unit 13 is used to record the collection time, the speaker identity information, and the translation information corresponding to the voice data to be simultaneously translated to obtain a piece of recorded information, and then at the end of the voice interview, at least A piece of record information;

The first generating unit 14 is configured to generate an interview record based on the at least one record segment information.

In some embodiments of the present application, the determining unit 11 is further configured to determine the target voiceprint information that matches the voiceprint of the voice data to be simultaneously transmitted from the voiceprint information of the participants in the preset mapping relationship And based on the target voiceprint information and the corresponding relationship between the participant’s identity information and the participant’s voiceprint information in the preset mapping relationship, determine all the voiceprints corresponding to the voice data to be simultaneously transmitted The speaker identity information, and based on the correspondence between the participant identity information in the preset mapping relationship and the target voice, obtain the listener's target language corresponding to the listener;

The translation unit 12 is also used to translate the voice data to be simultaneously translated into the target language of the listener in real time to obtain the translation information.

In some embodiments of the present application, the recording unit 13 is further configured to perform text recognition on the voice data to be simultaneously translated to obtain source text information; and record the collection corresponding to the voice data to be simultaneously translated Time, the speaker’s identity information, the translation information, and the source text information, until the target voiceprint information changes, obtain a piece of recorded information, and then at the end of the voice interview, obtain the at least one Record clip information.

In some embodiments of the present application, the hub device 1 further includes: an extracting unit 15;

The extraction unit 15 is configured to record the collection time, the speaker identity information, and the translation information corresponding to the voice data to be simultaneously translated to obtain a piece of record information, and then at the end of the voice interview , After obtaining at least one piece of record information, use abstract extraction technology to abstract the at least one piece of record information to extract the full text abstract information;

The first generating unit 14 is further configured to generate the interview record based on the at least one record fragment information and the full-text summary information.

The extraction unit 15 is configured to record the collection time, the speaker identity information, and the translation information corresponding to the voice data to be simultaneously translated, and after obtaining a piece of record fragment information, use the abstract extraction technology, A summary extraction is performed on one recorded segment information to extract the speaker summary information; at the end of the voice interview, the at least one recorded segment information and the at least one speaker summary information are obtained.

In some embodiments of the present application, the first generating unit 14 is further configured to generate the interview record based on the at least one record fragment information and the at least one speaker summary information.

In some embodiments of the present application, the hub device 1 further includes: a first sending unit 16;

The first receiving unit 10 is also configured to receive an interview generation instruction sent by the control terminal;

The first generating unit 14 is further configured to generate the interview record for the at least one piece of record information in the order of the time axis in response to the interview generating instruction;

The first sending unit 16 is configured to send the interview record to the control terminal.

It is understandable that because the central device can determine the identity information of the speaker and obtain the translation information in the language that meets the needs of the listener in the voice interview scene, according to the voice data of the speaker to be simultaneously translated, at the end of the interview, you can Based on the above information, the interview record for this interview is generated, so that the central device can perform real-time simultaneous interpretation of the voice data to be simultaneously interpreted, and at the same time, it can also record the identified speaker identity information, translation information and other structured data. Record fragment information, and finally at the end of the interview, multiple recorded fragment information will be obtained to generate the interview record of the voice interview. Therefore, the efficiency of data collation in the voice interview is improved, that is, the generation of the interview record of the voice interview is improved Speed and processing efficiency.

As shown in FIG. 13, an embodiment of the present application provides a control terminal 2. The control terminal 2 may include:

The second receiving unit 20 is configured to receive participant identity information, target language, and participant's voiceprint information;

The mapping unit 21 is configured to send the preset mapping relationship formed by the participant's identity information, the target language, and the participant's voiceprint information to the central device;

The second receiving unit 20 is further configured to receive an interview trigger instruction at the end of the interview;

The second generating unit 22 is configured to generate an interview generating instruction in response to the interview trigger instruction;

The second sending unit 23 is configured to send the interview generation instruction to the central device;

The second receiving unit 20 is configured to receive an interview record of the central device responding to the interview generation instruction feedback, the interview record being the central device responding to the interview generation instruction, based on the preset mapping relationship and It is generated from the voice data to be interpreted in real time.

In some embodiments of the present application, the control terminal 2 further includes: a display unit 24;

The display unit 24 is configured to display the interview records in the order of time axis after receiving the interview records of the instructions generated by the central device for the interview.

In some embodiments of the present application, each segment of the interview record in the interview record includes: speaker identity information, collection time, voice data to be simultaneously translated, and translation information.

In some embodiments of the present application, the display unit 24 is also used to arrange each segment of the interview record in the interview record in the order of the time axis according to the collection time; and to arrange each segment of the interview record corresponding to the The identity information of the speaker, the voice data to be simultaneously translated, and the translation information are displayed.

In some embodiments of the application, each segment of the interview record in the interview record further includes: speaker summary information;

The display unit 24 is also used to display the speaker identity information, the voice data to be transcribed, and the translation information corresponding to each segment of the interview record after the arrangement, and then to display each segment of the interview. In the first preset area in the recorded display area, the speaker summary information is displayed.

In some embodiments of the present application, the interview record further includes: full-text summary information;

The display unit 24 is also used to display the full-text summary information when displaying the speaker identity information, the voice data to be transcribed, and the translation information corresponding to each segment of the interview record after being arranged. It is displayed in front of the speaker identity information, the voice data to be transcribed, and the translation information corresponding to each segment of the interview record after the arrangement.

In some embodiments of the present application, the control terminal 2 further includes: an editing unit 25 and a display unit 24;

The second receiving unit 20 is further configured to receive an editing instruction after receiving the interview record of the instruction feedback generated by the central device for the interview;

The editing unit 25 is configured to edit the interview record in response to the editing instruction to obtain the final interview record;

The display unit 24 is used to display the final interview record.

In some embodiments of the present application, the control terminal 2 further includes: an export unit 26 and a sharing unit 27;

The second receiving unit 20 is configured to receive an export instruction after receiving the interview record of the instruction feedback generated by the central device for the interview;

The export unit 26 is configured to respond to the export instruction and process the interview record in a preset format to obtain an export file;

The sharing unit 27 is configured to share the exported file.

As shown in FIG. 14, an embodiment of the present application provides a hub device, including:

The first processor 17 and the first memory 18;

The first processor 17 is configured to execute the simultaneous interpretation program stored in the first memory 18 to implement the voice information processing method on the central device side.

As shown in FIG. 15, an embodiment of the present application provides a control terminal, including:

A second processor 28 and a second memory 29;

The second processor 28 is configured to execute the simultaneous interpretation program stored in the second memory 29 to implement the voice information processing method on the control terminal side.

In the embodiment of the present disclosure, the above-mentioned first processor 17 or second processor 28 may be an Application Specific Integrated Circuit (ASIC), a digital signal processor (Digital Signal Processor, DSP), or a digital signal processor. Device (Digital Signal Processing Device, DSPD), Programmable Logic Device (ProgRAMmable Logic Device, PLD), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), CPU, controller, microcontroller, microprocessor At least one. It is understandable that, for different devices, the electronic device used to implement the functions of the first processor 17 or the second processor 28 may also be other, which is not limited in the embodiment of the present disclosure. The hub device further includes a first memory 18, and the control terminal further includes a second memory 29. The first memory 18 can be connected to the first processor 17, and the second memory 30 can be connected to the second processor 28. Among them, the first memory 18 or the second memory 29 may include a high-speed RAM memory, or may also include a non-volatile memory, for example, at least two disk memories.

In practical applications, the above-mentioned first memory 18 or second memory 29 may be volatile memory (volatile memory), such as random-access memory (Random-Access Memory, RAM); or non-volatile memory (non-volatile memory). memory), such as read-only memory (Read-Only Memory, ROM), flash memory (flash memory), hard disk (HDD) or solid-state drive (Solid-State Drive, SSD); or Combine and provide instructions and data to the first processor 17 or the second processor 28.

In addition, the functional modules in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be realized in the form of hardware or software function module.

Among them, if the integrated unit is implemented in the form of a software function module and is not sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this embodiment is essentially or It is said that the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions to enable a computer device (which can It is a personal computer, a server, or a network device, etc.) or a processor (processor) that executes all or part of the steps of the method in this embodiment. The aforementioned storage media include: U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.

Therefore, an embodiment of the present application also provides a computer-readable storage medium on which a simultaneous interpretation program is stored, and the computer program is executed by one or more first processors to realize the voice information processing method on the central device side.

The embodiment of the present application also provides a computer-readable storage medium on which a simultaneous interpretation program is stored, and when the computer program is executed by one or more second processors, it realizes the voice information processing method on the control terminal side.

The computer-readable storage medium may be a volatile memory (volatile memory), such as a random-access memory (Random-Access Memory, RAM); or a non-volatile memory (non-volatile memory), such as a read-only memory (Read-only memory). Only Memory, ROM, flash memory, Hard Disk Drive (HDD), or Solid-State Drive (SSD); it can also be a respective device including one or any combination of the above-mentioned memories, such as Mobile phones, computers, tablet devices, personal digital assistants, etc.

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of hardware embodiments, software embodiments, or embodiments combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to the schematic diagrams and/or block diagrams of the methods, devices (systems), and computer program products according to the embodiments of the application. It should be understood that each process and/or block in the schematic flow diagram and/or block diagram can be realized by computer program instructions, and the combination of processes and/or blocks in the schematic flow diagram and/or block diagram can be realized. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device for realizing the functions specified in one or more processes in the schematic flow chart and/or one block or more blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device realizes the functions specified in one or more processes in the schematic diagram and/or one block or more in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in one or more processes in the schematic diagram and/or one block or more in the block diagram.

The above are only specific implementations of this application, but the scope of protection of this application is not limited to this. Anyone familiar with the technical field within the technical scope disclosed in this application shall be covered by the scope of this application. Within the scope of protection.

Industrial applicability

In the voice information processing method provided by the embodiments of the present application, the central device can determine the identity information of the speaker and obtain the translation information in the language that meets the needs of the listener for the voice data of the speaker to be simultaneously translated in the voice interview scene. At the end of the interview, an interview record for the interview can be generated based on the above information, so that the central device can perform real-time simultaneous interpretation of the voice data to be simultaneously interpreted, and can also record the identified speaker identity information and translation information. The structured data is used as the recorded fragment information. At the end of the interview, multiple recorded fragments will be obtained to generate the interview record of the voice interview. Therefore, the efficiency of data collation in the voice interview is improved, that is, the voice interview is improved. The generation speed and processing efficiency of interview records.

Claims

A voice information processing method, including:

After the start of the voice interview, receiving the voice data to be interpreted from the speaker transmitted by the collection terminal, and acquiring the collection time when the voice data to be transcribed was collected;

Based on the voice data to be interpreted and a preset mapping relationship, the identity information of the speaker is determined, and the voice data to be interpreted simultaneously is translated into the listener's target language in real time to obtain translation information; the preset mapping relationship is The correspondence between the identity information of the participant, the target language, and the voiceprint information of the participant; wherein the listener is a person other than the speaker among the participants;

Record the collection time, the speaker identity information, and the translation information corresponding to the voice data to be simultaneously translated to obtain one piece of recorded information, and then at the end of the voice interview, obtain at least one piece of recorded information;

Based on the at least one piece of record information, an interview record is generated.
The method according to claim 1, wherein said determining the identity information of the speaker based on the voice data to be simultaneously interpreted and a preset mapping relationship, and simultaneously interpreting the voice data to be simultaneously interpreted into a listener target in real time Language, get translation information, including:

From the voiceprint information of the participants in the preset mapping relationship, determine the target voiceprint information that matches the voiceprint of the voice data to be simultaneously transmitted;

Based on the target voiceprint information and the corresponding relationship between the participant identity information in the preset mapping relationship and the voiceprint information of the participant, the speech corresponding to the voiceprint of the voice data to be simultaneously transmitted is determined The identity information of the listener, and based on the corresponding relationship between the identity information of the participant in the preset mapping relationship and the target voice, obtain the listener's target language corresponding to the listener;

Translate the voice data to be simultaneously translated into the target language of the listener in real time to obtain the translation information.
The method according to claim 1, wherein the recording of the collection time, the speaker identity information and the translation information corresponding to the to-be-translated voice data obtains a piece of record fragment information, which is then recorded in the voice data. At the end of the interview, at least one recorded piece of information was obtained, including:

Performing text recognition on the voice data to be simultaneously translated to obtain source text information;

Record the collection time, the speaker identity information, the translation information, and the source text information corresponding to the voice data to be simultaneously translated, until the target voiceprint information changes, obtain a piece of recorded information , And then at the end of the voice interview, the at least one recorded segment information is obtained.
The method according to claim 1, wherein the recording of the collection time, the speaker identity information and the translation information corresponding to the to-be-translated voice data obtains a piece of record fragment information, which is then recorded in the voice data. At the end of the interview, after obtaining at least one piece of recorded information, the method further includes:

Using abstract extraction technology, abstract extraction is performed on the at least one record fragment information, and full text abstract information is extracted;

The generating an interview record based on the information of the at least one record segment includes:

Based on the at least one recorded segment information and the full-text summary information, the interview record is generated.
The method according to claim 1 or 4, wherein after the recording of the collection time, the speaker identity information, and the translation information corresponding to the voice data to be simultaneously translated, after obtaining a piece of record information, The method also includes:

Using abstract extraction technology, extract a summary of the information of a record segment, and extract the summary information of the speaker;

At the end of the voice interview, the at least one recorded segment information and the at least one speaker summary information are obtained.
The method according to claim 5, wherein said generating an interview record based on said at least one record fragment information comprises:

Based on the at least one recorded segment information and the at least one speaker summary information, the interview record is generated.
The method according to claim 1, wherein said generating an interview record based on said at least one record fragment information comprises:

Receive interview generation instructions sent by the control terminal;

In response to the interview generation instruction, generating the interview record for the at least one piece of record information in the order of the time axis;

After the interview record is generated based on the at least one record piece information, the method further includes:

Send the interview record to the control terminal.
A voice information processing method, including:

Receive participant identity information, target language, and participant’s voiceprint information;

Sending the preset mapping relationship formed by the participant's identity information, the target language, and the participant's voiceprint information to the central device;

At the end of the interview, receive an interview trigger instruction, and generate an interview generation instruction in response to the interview trigger instruction;

Sending the interview generation instruction to the central device;

Receive the interview record of the central device's feedback on the interview generation instruction, the interview record is generated by the central device in response to the interview generation instruction based on the preset mapping relationship and the voice data to be simultaneously transmitted received in real time .
8. The method according to claim 8, wherein after the receiving the interview record of the instruction feedback generated by the central device for the interview, the method further comprises:

The interview records are displayed in order of time axis.
The method according to claim 9, wherein:

Each segment of the interview record in the interview record includes: speaker identity information, collection time, voice data to be simultaneously translated, and translation information.
The method according to claim 10, wherein said displaying said interview records in order of time axis comprises:

Arrange each segment of the interview record in the interview record in the order of the time axis according to the collection time;

Display the speaker identity information, the voice data to be simultaneously translated, and the translation information corresponding to each segment of the interview record after the arrangement.
The method according to claim 11, wherein each segment of the interview record in the interview record further comprises: speaker summary information; the speaker identity information and the speaker identity information corresponding to each segment of the interview record after being arranged After the simultaneous interpretation voice data and the translation information are displayed, the method further includes:

In the first preset area in the display area of each interview record, the speaker summary information is displayed.
The method according to claim 11, wherein the interview record further includes: full-text summary information; the speaker identity information corresponding to each segment of the interview record after the arrangement, the voice data to be transcribed, and When the translated information is displayed, the method further includes:

The full text summary information is displayed in front of the speaker identity information, the voice data to be transcribed, and the translation information corresponding to each segment of the interview record after the arrangement.
8. The method according to claim 8, wherein after the receiving the interview record of the instruction feedback generated by the central device for the interview, the method further comprises:

Receive editing instructions;

In response to the editing instruction, the interview record is edited, and the final interview record is obtained and displayed.
8. The method according to claim 8, wherein after the receiving the interview record of the instruction feedback generated by the central device for the interview, the method further comprises:

Receive export instructions;

In response to the export instruction, process the interview record in a preset format to obtain an export file;

Share the exported file.
A central equipment including:

The first receiving unit is configured to receive the voice data for simultaneous interpretation of the speaker transmitted by the collection terminal after the start of the voice interview, and obtain the collection time when the voice data for simultaneous interpretation is started;

The determining unit is configured to determine the identity information of the speaker based on the voice data to be simultaneously transmitted and the preset mapping relationship;

The translation unit is used for real-time translation of the voice data to be simultaneously translated into the target language of the listener to obtain translation information; the preset mapping relationship is between the identity information of the participant, the target language and the voiceprint information of the participant Correspondence relationship; wherein, the listener is a person other than the speaker among the participants;

The recording unit is used to record the collection time, the speaker identity information, and the translation information corresponding to the voice data to be simultaneously translated to obtain a piece of recorded piece information, and then at the end of the voice interview, at least one piece of information is obtained Record fragment information;

The first generating unit is configured to generate interview records based on the at least one record segment information.
A control terminal, including:

The second receiving unit is used to receive the participant's identity information, the target language and the participant's voiceprint information;

A mapping unit, configured to send the preset mapping relationship formed by the participant's identity information, the target language, and the participant's voiceprint information to the central device;

The second receiving unit is further configured to receive an interview trigger instruction at the end of the interview;

The second generation unit is configured to generate an interview generation instruction in response to the interview trigger instruction;

The second sending unit is configured to send the interview generation instruction to the central device;

The second receiving unit is configured to receive an interview record of the central device responding to the interview generation instruction feedback, the interview record being the central device responding to the interview generation instruction, based on the preset mapping relationship and real-time Generated by the received voice data to be interpreted simultaneously.
A central equipment including:

The first processor and the first memory;

The first processor is configured to execute the simultaneous interpretation program stored in the first memory to implement the voice information processing method according to any one of claims 1 to 7.
A control terminal, including:

A second processor and a second memory;

The second processor is configured to execute the simultaneous interpretation program stored in the second memory to implement the voice information processing method according to any one of claims 8 to 15.
A storage medium on which a simultaneous interpretation program is stored, and when the simultaneous interpretation program is executed by a first processor, the method for processing voice information according to any one of claims 1 to 7 is implemented; or, the simultaneous interpretation program When the voice interpretation program is executed by the second processor, the voice information processing method according to any one of claims 8 to 15 is realized.