CN114503117A - Voice information processing method, center device, control terminal and storage medium - Google Patents

Voice information processing method, center device, control terminal and storage medium Download PDF

Info

Publication number
CN114503117A
CN114503117A CN201980101053.3A CN201980101053A CN114503117A CN 114503117 A CN114503117 A CN 114503117A CN 201980101053 A CN201980101053 A CN 201980101053A CN 114503117 A CN114503117 A CN 114503117A
Authority
CN
China
Prior art keywords
information
interview
speaker
voice data
record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980101053.3A
Other languages
Chinese (zh)
Inventor
郝杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Shenzhen Huantai Technology Co Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Shenzhen Huantai Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd, Shenzhen Huantai Technology Co Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Publication of CN114503117A publication Critical patent/CN114503117A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voice information processing method, a center device (1), a control terminal (2) and a storage medium, the method comprising: after a voice interview starts, receiving to-be-simultaneously-transmitted voice data of a speaker transmitted by a collection terminal, and acquiring collection time for collecting to-be-simultaneously-transmitted voice data (S101); determining identity information of a speaker based on the voice data to be simultaneously transmitted and a preset mapping relation, and simultaneously transmitting and translating the voice data to be simultaneously transmitted into a target language of a listener in real time to obtain translation information; the preset mapping relationship is the corresponding relationship between the identity information of the participant, the target language and the voiceprint information of the participant; wherein the listener is a person other than the speaker among the participants (S102); recording acquisition time, speaker identity information and translation information corresponding to the voice data to be simultaneously transmitted to obtain recording segment information, and further obtaining at least one recording segment information when the voice interview is finished (S103); an interview record is generated based on the at least one recording section information (S104).

Description

Voice information processing method, center device, control terminal and storage medium Technical Field
The embodiment of the application relates to the technical field of voice processing, in particular to a voice information processing method, a center device, a control terminal and a storage medium.
Background
With the increasing trend of the economic globalization, the communication between different countries and different cultures is more and more frequent.
In a multi-person interview scenario or conference, there may be obstacles to communicating with each other as the participants may be from different countries and regions. In addition, after the interview is finished, huge manpower is consumed for translating and arranging interview records, the efficiency is low, and the interview content is not easy to be rapidly published and spread to the outside.
Disclosure of Invention
The embodiment of the application is expected to provide a voice information processing method, a center device, a control terminal and a storage medium, which can generate an interview record and improve the generation speed and the processing efficiency of the interview record of voice interview.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a voice information processing method, which comprises the following steps:
after a voice interview starts, receiving voice data to be simultaneously transmitted of a speaker transmitted by an acquisition terminal, and acquiring acquisition time for starting acquisition of the voice data to be simultaneously transmitted;
determining identity information of a speaker based on the voice data to be simultaneously transmitted and a preset mapping relation, and simultaneously transmitting and translating the voice data to be simultaneously transmitted into a target language of a listener in real time to obtain translation information; the preset mapping relationship is a corresponding relationship among the identity information of the participant, the target language and the voiceprint information of the participant; wherein a listener is a person of the participants other than the speaker;
recording the acquisition time, the speaker identity information and the translation information corresponding to the voice data to be simultaneously transmitted to obtain a recording segment information, and further obtaining at least one recording segment information when the voice interview is finished;
generating an interview record based on the at least one recording segment information.
In the above scheme, the determining speaker identity information based on the to-be-simulcast voice data and a preset mapping relationship, and performing real-time simulcast translation on the to-be-simulcast voice data into a listener target language to obtain translation information includes:
determining target voiceprint information matched with the voiceprint of the voice data to be simultaneously transmitted from the voiceprint information of the participants in the preset mapping relation;
determining the identity information of the speaker corresponding to the voiceprint of the voice data to be simultaneously transmitted based on the target voiceprint information and the corresponding relationship between the participant identity information in the preset mapping relationship and the voiceprint information of the participant, and acquiring the target language of the listener corresponding to the listener based on the corresponding relationship between the participant identity information in the preset mapping relationship and the target voice;
and translating the voice data to be simultaneously transmitted into the target language of the listener in real time to obtain the translation information.
In the above scheme, the recording the acquisition time, the speaker identity information, and the translation information corresponding to the to-be-simultaneously-transmitted voice data to obtain a recording segment information, and further obtaining at least one recording segment information when the voice interview is finished, includes:
performing text recognition on the voice data to be simultaneously transmitted to obtain source text information;
recording the acquisition time, the speaker identity information, the translation information and the source text information corresponding to the voice data to be simultaneously transmitted until the target voiceprint information is changed to obtain a recording segment information, and further obtaining the at least one recording segment information when the voice interview is finished.
In the above scheme, the recording the acquisition time, the speaker identity information, and the translation information corresponding to the to-be-simultaneously-transmitted voice data to obtain a recording segment information, and further after obtaining at least one recording segment information when the voice interview is finished, the method further includes:
abstract extraction is carried out on the at least one recording segment information by adopting an abstract extraction technology, and full text abstract information is extracted;
generating an interview record based on the at least one recording segment information, comprising:
generating the interview record based on the at least one record fragment information and the full-text summary information.
In the above scheme, after the recording the acquisition time, the speaker identity information, and the translation information corresponding to the to-be-simultaneously-transmitted voice data to obtain a recording segment information, the method further includes:
abstract extraction is carried out on the information of one recording segment by adopting an abstract extraction technology, and speaker abstract information is extracted;
and acquiring the at least one recording segment information and the at least one speaker summary information when the voice interview is finished.
In the foregoing solution, the generating an interview record based on the at least one recording segment information includes:
generating the interview record based on the at least one record segment information and the at least one speaker summary information.
In the foregoing solution, the generating an interview record based on the at least one recording segment information includes:
receiving an interview generation instruction sent by a control terminal;
generating the interview record for the at least one recording clip information in the order of the timeline in response to the interview generation instruction;
after generating the interview record based on the at least one recording segment information, the method further comprises:
and sending the interview record to the control terminal.
The embodiment of the application further provides a voice information processing method, which comprises the following steps:
receiving identity information of a participant, a target language and voiceprint information of the participant;
sending a preset mapping relation formed by the participant identity information, the target language and the voiceprint information of the participant to the central control equipment;
when the interview is finished, receiving an interview trigger instruction, and generating an interview generation instruction in response to the interview trigger instruction;
sending the interview generation instructions to the hub device;
and receiving an interview record fed back by the central hub equipment aiming at the interview generation instruction, wherein the interview record is generated by the central hub equipment responding to the interview generation instruction and based on the preset mapping relation and the real-time received voice data to be simultaneously transmitted.
In the above solution, after receiving the interview record of the central device generating instruction feedback for the interview, the method further includes:
and displaying the interview records according to a time shaft sequence.
In the above scheme, each of the interview records includes: speaker identity information, acquisition time, voice data to be simultaneously transmitted and translation information.
In the above scheme, the displaying the interview records according to a time axis sequence includes:
arranging each section of interview record in the interview records according to the time of acquisition;
and displaying the identity information of the speaker, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record.
In the above solution, each of the interview records further includes: speaker summary information; after displaying the speaker identity information, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record, the method further comprises the following steps:
and displaying the speaker summary information in a first preset area in the display area of each interview record.
In the above solution, the interview record further includes: full text summary information; when the speaker identity information, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record are displayed, the method further comprises the following steps:
and displaying the full text summary information in front of the identity information of the speaker, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record.
In the above solution, after receiving the interview record of the central device generating instruction feedback for the interview, the method further includes:
receiving an editing instruction;
and responding to the editing instruction, editing the interview record to obtain a final interview record and displaying the final interview record.
In the above solution, after receiving the interview record of the central device generating instruction feedback for the interview, the method further includes:
receiving a derivation instruction;
responding to the export instruction, and performing preset format processing on the interview record to obtain an export file;
and sharing the export file.
An embodiment of the present application provides a hub device, including:
the first receiving unit is used for receiving the voice data to be simultaneously transmitted of a speaker transmitted by the acquisition terminal after the voice interview starts and acquiring the acquisition time for starting to acquire the voice data to be simultaneously transmitted;
the determining unit is used for determining identity information of a speaker based on the voice data to be simultaneously transmitted and a preset mapping relation;
the translation unit is used for simultaneously translating the voice data to be simultaneously transmitted into a listener target language in real time to obtain translation information; the preset mapping relationship is a corresponding relationship among the identity information of the participant, the target language and the voiceprint information of the participant; wherein a listener is a person of the participants other than the speaker;
the recording unit is used for recording the acquisition time, the identity information of the speaker and the translation information corresponding to the voice data to be simultaneously transmitted to obtain recording segment information, and further obtaining at least one recording segment information when the voice interview is finished;
a first generating unit, configured to generate an interview record based on the at least one recording segment information.
The embodiment of the application provides a control terminal, including:
the second receiving unit is used for receiving the identity information of the participant, the target language and the voiceprint information of the participant;
the mapping unit is used for sending a preset mapping relation formed by the participant identity information, the target language and the voiceprint information of the participant to the central equipment;
the second receiving unit is further used for receiving an interview trigger instruction when the interview is finished;
a second generating unit, configured to generate an interview generation instruction in response to the interview trigger instruction;
the second sending unit is used for sending the interview generation instruction to the center equipment;
the second receiving unit is used for receiving an interview record fed back by the center equipment according to the interview generation instruction, wherein the interview record is generated by the center equipment responding to the interview generation instruction and based on the preset mapping relation and the real-time received voice data to be simultaneously transmitted.
An embodiment of the present application further provides a hub device, including:
a first processor and a first memory;
the first processor is configured to execute the simultaneous interpretation program stored in the first memory to realize the voice information processing method on the center device side.
An embodiment of the present application further provides a control terminal, including:
a second processor and a second memory;
the second processor is configured to execute the simultaneous interpretation program stored in the second memory to realize the voice information processing method on the control terminal side.
The embodiment of the application provides a storage medium, wherein a simultaneous interpretation program is stored on the storage medium, and the simultaneous interpretation program realizes the voice information processing method on the side of the central device when being executed by a first processor; or, the simultaneous interpretation program is executed by the second processor to realize the voice information processing method on the control terminal side.
An embodiment of the present application is intended to provide a voice information processing method, a central device, a control terminal, and a storage medium, including: after the voice interview starts, receiving the voice data to be simultaneously transmitted of the speaker transmitted by the acquisition terminal, and acquiring the acquisition time for starting to acquire the voice data to be simultaneously transmitted; determining identity information of a speaker based on the voice data to be simultaneously transmitted and a preset mapping relation, and simultaneously transmitting and translating the voice data to be simultaneously transmitted into a target language of a listener in real time to obtain translation information; the preset mapping relationship is the corresponding relationship between the identity information of the participant, the target language and the voiceprint information of the participant; wherein the listener is a person of the participants other than the speaker; recording acquisition time, speaker identity information and translation information corresponding to voice data to be simultaneously transmitted to obtain recording segment information, and further obtaining at least one recording segment information when the voice interview is finished; generating an interview record based on the at least one recording clip information. By adopting the technical scheme, the central hub equipment can determine the identity information of a speaker and acquire the translation information of the language meeting the requirements of listeners aiming at the voice data to be transmitted of the speaker in a voice interview scene, and can generate the interview record aiming at the interview based on the information when the interview is finished, so that the central hub equipment can record the determined structural data such as the identity information, the translation information and the like of the speaker as the recording segment information while performing real-time co-transmission translation on the voice data to be transmitted, and finally generate the interview record of the voice interview by obtaining a plurality of recording segment information when the interview is finished, therefore, the efficiency of data arrangement in the voice interview is improved, namely, the generation speed and the processing efficiency of the interview record of the voice interview are improved.
Drawings
FIG. 1 is a block diagram of a voice information processing system according to an embodiment of the present application;
fig. 2 is a first flowchart illustrating a voice information processing method according to an embodiment of the present application;
fig. 3 is a flowchart illustrating a second method for processing a voice message according to an embodiment of the present application;
fig. 4 is a third schematic flowchart of a voice information processing method according to an embodiment of the present application;
fig. 5 is a first flowchart illustrating a voice information processing method according to a further embodiment of the present application;
fig. 6 is a flowchart illustrating a second method for processing voice information according to an embodiment of the present application;
fig. 7 is a first schematic view of an exemplary presentation interface of an interview record provided in an embodiment of the present application;
fig. 8 is a schematic view of an exemplary presentation interface of an interview record provided in an embodiment of the present application;
fig. 9 is a schematic view of a presentation interface of an exemplary interview record provided in an embodiment of the present application;
fig. 10 is a schematic view of an exemplary presentation interface of an interview record provided in an embodiment of the present application;
fig. 11 is an interaction diagram of a voice information processing method according to an embodiment of the present application;
FIG. 12 is a first schematic structural diagram of a hub device according to an embodiment of the present disclosure;
FIG. 13 is a schematic diagram of a second embodiment of a hub apparatus;
fig. 14 is a first schematic structural diagram of a control terminal according to an embodiment of the present disclosure;
fig. 15 is a schematic structural diagram of a second composition of the control terminal according to the embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant application and are not limiting of the application. It should be noted that, for the convenience of description, only the parts related to the related applications are shown in the drawings.
The embodiment of the application provides a voice information processing method which is realized through a voice information processing device, and the voice information processing device provided by the embodiment of the application can comprise a central hub device, a control terminal and a receiving and transmitting integrated terminal (comprising an acquisition terminal and a receiving terminal).
FIG. 1 is a schematic diagram of a voice message processing system to which a voice message processing method is applied; as shown in fig. 1, the voice information processing system may include: the center equipment 1, the control terminal 2 and a plurality of receiving and transmitting integrated terminals 3 (including an acquisition terminal 3-1 and a receiving terminal 3-2).
In a scene of multi-person interview or multi-person meeting and the like, a speaker (speaker) can carry out conference presentation by a receiving and transmitting integrated terminal 3 (namely an acquisition terminal 3-1) worn by the speaker, in the process of carrying out the conference presentation, the acquisition terminal 3-1 acquires voice data of a speaker (namely voice data to be simultaneously transmitted), and transmits the voice data to be simultaneously transmitted to the central control device 1 in real time, when the central control device 1 acquires the voice data to be simultaneously transmitted, the time for starting to receive the voice data to be simultaneously transmitted is taken as the acquisition time to be acquired, then the identity information of the speaker who is speaking or speaking is determined based on the voice data to be simultaneously transmitted and the preset mapping relation, simultaneously, synchronously transmitting and translating the voice data to be simultaneously transmitted into a target language of a listener in real time to obtain translation information (specifically, correspondingly sending translation results of corresponding languages according to the languages required by the listener); the translation information is simultaneously transmitted to an integrated terminal (namely a receiving terminal 3-2) of a listener participating in the conference in real time, the voice data to be simultaneously transmitted corresponding to each speaker, the acquisition time, the speaker identity information, the translation information and the like are recorded into the recording segment information corresponding to the speaker, besides, when the conference is ended, the central hub equipment 1 can receive an interview generation instruction transmitted by the control terminal 2, generate interview records for all the recording segment information obtained in the conference according to the interview generation instruction, and finally transmit the interview records to the control terminal 2, so that the control terminal 2 can display the interview records, or share the interview records to user terminals owned by the participants participating in the conference through the control terminal 2, and the participants can visit or browse the conference records.
Fig. 2 is a first flowchart illustrating a voice information processing method according to an embodiment of the present application. As shown in fig. 2, the voice information processing method applied to the center device includes the steps of:
s101, after a voice interview starts, receiving voice data to be simultaneously transmitted of a speaker transmitted by a collecting terminal, and obtaining collecting time for starting to collect the voice data to be simultaneously transmitted.
The voice information processing method provided by the embodiment of the application can be applied to international conferences, international interviews or various scenes needing simultaneous transmission and translation, and the embodiment of the application is not limited.
It should be noted that, in the embodiment of the present application, the application scenarios may also be divided into a large international conference, a small working conference, a public service place, a public social place, a social application, a general scenario, and the like. The public service places can be waiting halls, government office halls and the like, and the public social places can be coffee halls, music halls and the like. The actual application scene corresponding to the voice data to be simultaneously transmitted is actually the application scene in which the voice data to be simultaneously transmitted is specifically collected. Specific practical application scenarios the embodiments of the present application are not limited.
In the embodiment of the present application, the hub device communicates with an integrated terminal and a control terminal, the integrated terminal is a transceiver integrated terminal worn by a participant participating in a voice interview, for example, an earphone/microphone integrated terminal, the integrated terminal worn by a speaker at each time may also be referred to as a collecting terminal, and the integrated terminal worn by other listeners may be referred to as a receiving terminal.
The communication method may be a wireless communication technology, a wired communication technology, a near field communication technology, and the like, for example, bluetooth or Wi-Fi, and the embodiments of the present application are not limited thereto.
It should be noted that, the integrated terminal has both an earphone and a microphone, the microphone is used for collecting the voice data to be simultaneously transmitted when speaking, and the earphone is used for playing the translation information when listening. Therefore, each participant can be either a speaker or a listener, which is determined according to the actual situation, and the embodiment of the present application is not limited.
After the voice interview starts, a speaker adopts the acquisition terminal to receive the voice data to be simultaneously transmitted, and transmits the voice data to be simultaneously transmitted to the central control equipment in real time, and the central control equipment simultaneously acquires the time for starting to receive the voice data to be simultaneously transmitted, namely the acquisition time.
It should be noted that the voice data to be simultaneously transmitted of each speaker is transmitted in real time, but the central hub device only acquires the time when each speaker starts speaking, i.e. the acquisition time. In an embodiment of the present application, the to-be-simultaneously-transmitted voice data may be any voice that needs to be subjected to voice translation, for example, a voice collected in real time in an application scenario. Further, the to-be-simulcast voice data may be voice in any type of language. Specific to-be-simultaneously-transmitted voice data, the embodiment of the present application is not limited.
In the embodiment of the present application, after the interview is started, there may be multiple speakers speaking, where a speaker in the present application is a person speaking each time among the participants, and the embodiment of the present application is not limited.
S102, determining identity information of a speaker based on the voice data to be simultaneously transmitted and a preset mapping relation, and simultaneously transmitting and translating the voice data to be simultaneously transmitted into a target language of a listener in real time to obtain translation information; the preset mapping relationship is the corresponding relationship between the identity information of the participant, the target language and the voiceprint information of the participant; wherein the listener is a person of the participants other than the speaker.
After the central hub equipment acquires the voice data to be simultaneously transmitted, as the central hub equipment stores a preset mapping relation which is the corresponding relation among the participant identity information, the target language and the voiceprint information of the participants, the central hub equipment can firstly find the target voiceprint information matched with the voice data to be simultaneously transmitted based on the voiceprint information of each participant stored in the preset mapping relation, then find the participant identity information corresponding to the target voiceprint information, namely the speaker identity information, and simultaneously confirm the listener identity information based on the corresponding relation between the participant identity information and the target language, then determine the target language of the listener from the corresponding relation between the participant identity information and the target language, and further translate the voice data to be simultaneously transmitted into the translation information of the target language of the listener corresponding to each listener in real time, the individual listeners can use the familiar voice and hear the speaking of the speaker through the receiving terminal. Wherein the listener is a person of the participants other than the speaker.
In this embodiment of the present application, the identity information of the speaker may be a unique identifier of a name or an identity of the speaker, and this embodiment of the present application is not limited.
It should be noted that the preset mapping relationship is a corresponding relationship between the participant identity information, the target language, and the voiceprint information of the participant, and may be represented as a corresponding relationship between the participant identity information and the target language, a corresponding relationship between the participant identity information and the voiceprint information of the participant, a relationship between the target language and the voiceprint information of the participant, a voiceprint information library of the participant, a participant identity information library, a target language library, and the like.
That is to say, in the embodiment of the present application, the hub device determines, from the voiceprint information of the participant in the preset mapping relationship, target voiceprint information that is voiceprint matched with the voice data to be transmitted simultaneously; determining speaker identity information corresponding to the voiceprint of the voice data to be simultaneously transmitted based on the target voiceprint information and the corresponding relationship between the participant identity information in the preset mapping relationship and the voiceprint information of the participant, and acquiring a listener target language corresponding to the listener based on the corresponding relationship between the participant identity information in the preset mapping relationship and the target voice; and finally, translating the voice data to be simultaneously transmitted into the target language of the listener in real time to obtain translation information.
In some embodiments of the present application, after the central device learns the corresponding relationship between the target language and the voiceprint information of the participants, after the target voiceprint information is determined, since the participants except for the speaker are listeners, the target language corresponding to the other unmatched voiceprint information is the listener target language after the target voiceprint information corresponding to the speaker is determined, so that the listener target language corresponding to the listener can be determined.
Illustratively, assuming that the participants are A, B and C, and a is speaking, when it is determined that a is the speaker, then B and C are listeners, so that the target languages corresponding to the listeners B and C can be found from the a target language, the B target language and the C target language.
In some embodiments of the present application, functions (supporting an online or offline mode) such As Speech Recognition (ASR), Speech synthesis (TTS), Text-To-Speech), voiceprint Recognition, translation, and recording are built in the hub device, and the hub device has networking and communication functions and can perform data interaction with the control terminal and the integrated terminal.
It can be understood that the central hub device combines voiceprint recognition, an ASR technology, a machine translation technology and a TTS technology to construct a set of simultaneous interpretation system in an interview scene, and communication obstacles between different languages are solved.
In the embodiment of the application, the central device adopts a voiceprint recognition technology to recognize target voiceprint information matched with the to-be-simulcast voice data from the voiceprint information of the participants in the preset mapping relation, and adopts a machine translation technology to simultaneously translate the to-be-simulcast voice data into the target language of the listener in real time, so that translation information which can be understood by each listener is obtained.
In some embodiments of the present application, after translating the translation information required by each listener in real time, the hub device may send the translation information to the receiving terminals of the listeners participating in the voice interview in real time, so that the listeners can hear the speaking information in the familiar language, i.e. the translated voice data, through the respective corresponding receiving terminals.
It should be noted that, when the hub device communicates with each integrated terminal, the corresponding relationship between each integrated terminal and the participant is stored, that is, the hub device can accurately send the data to be sent to the participant to the integrated terminal corresponding to the participant. Thus, the central device can correspondingly transmit the respective translation information obtained according to the target language of each listener to the receiving terminal of the listener.
In some embodiments of the present application, the translation information includes translation text information and translation speech data. After the central equipment translates the voice data to be simultaneously transmitted into translation text information, the translation text information is converted into translation voice data by adopting a TTS technology. Thus, the hub device can transmit the translated voice data and the translated text information to the receiving terminals of the listeners who participate in the voice interview in real time.
In this embodiment of the application, if a display device is disposed on the all-in-one terminal, the translated text information may also be displayed on the display for a listener to watch, and the specific implementation of this embodiment of the application is not limited.
S103, recording acquisition time, speaker identity information and translation information corresponding to the voice data to be simultaneously transmitted to obtain recording segment information, and further obtaining at least one recording segment information when the voice interview is finished.
After the central hub equipment obtains the translation information corresponding to each listener, the central hub equipment can obtain one recording fragment information of a speaker by recording the acquisition time, the speaker identity information and the translation information corresponding to the voice data to be simultaneously transmitted, and then continue to obtain the recording fragment information of the next speaker, so that when the voice interview is finished, the central hub equipment can obtain at least one recording fragment information corresponding to different speakers.
It should be noted that, one recorded segment information may include the voice data to be transmitted simultaneously, the identity information of the speaker, the translation information, and the acquisition time. Each recording segment may use fields for data recording.
In some embodiments of the present application, the hub device may perform text recognition on the to-be-co-transmitted voice data to obtain source text information; recording acquisition time, speaker identity information, translation information and source text information corresponding to the voice data to be simultaneously transmitted until target voiceprint information is changed to obtain recording fragment information, and further obtaining at least one recording fragment information when the voice interview is finished.
Further, in this embodiment, one recording clip information may further include: source text information.
It should be noted that the central device uses ASR technology to convert the speech data to be simultaneously transmitted into the source text information.
In the embodiment of the application, in a multi-person interview scene, multiple participants generally speak in a cross-rotation manner, and it is also possible that a certain speaker is interrupted during speaking; the interruption can be regarded as that the speaking object is switched, and the division of the recording segment takes the switching or ending of the speaker as a boundary. That is to say, the hub device identifies the identity of a speaker based on a voiceprint recognition technology for voice data to be simultaneously transmitted in real time, when the received identity of the speaker at a certain time is changed, the central device represents that the speaker finishes speaking, and the next speaker starts speaking, so that the implementation of S101 is returned to, and the information corresponding to the previous speaker is recorded as the recording segment information, so that when the interview is finished, the hub device can acquire at least one recording segment information.
It should be noted that at least one recording segment information may include recording segment information of the same speaker at different times, which is subject to actual recording conditions, and the embodiment of the present application is not limited.
Illustratively, the voiceprint and name of each attendee is entered first before the interview begins. After the interview starts, every sentence spoken for each person is saved, and one recording fragment information can be recorded in the unit of complete fragment as shown in table 1:
TABLE 1
Field(s) Meaning of a field
Speaker name Pronouncing a person's name Is obtained by voiceprint recognition
Time stamp Utterance start time stamp
Audio frequency Source language audio corresponding to utterance Acquired by a microphone
Speaking text Text corresponding to speech audio of source language ASR derivation
Translating text Translating into spoken text in a target language Translation results
The speaker name is speaker identity information, the timestamp is acquisition time, the audio is voice data to be simultaneously transmitted, the speech text is source text information, and the translation text is translation information.
And S104, generating the interview record based on the at least one record fragment information.
The hub device records at least one recording segment information at the end of the interview, so that the hub device can generate an interview record for the interview based on the at least one recording segment information.
In the embodiment of the application, the hub device can be communicated with the control terminal, and the control terminal is used for receiving some conventional settings of input, such as target language, number of people, listening language of each integral terminal and the like. Besides, the control terminal can also control the function realization of the central control equipment. Such as interview record generation functionality, etc.
In some embodiments of the present application, the hub device may receive an interview generation instruction sent by the control terminal; generating an interview record for at least one recording clip information in the order of the timeline in response to the interview generation instruction; the hub device sends the interview record to the control terminal.
It should be noted that, an input device may be disposed at the control terminal side, and after the voice interview is finished, a user may generate an interview generation instruction through the input device, and then send the interview generation instruction to the hub device, so that at least one recording segment information about the voice interview has been recorded in the hub device, and then the hub device generates an interview record from the at least one recording segment information, and then sends the interview record to the control terminal, so that the control terminal presents the interview record.
It can be understood that, because the central hub device can determine speaker identity information and acquire translation information of a language meeting listener requirements for the voice data to be spoken of a speaker in a voice interview scene, and can generate an interview record for the interview based on the information when the interview is finished, the central hub device can record structured data such as the determined speaker identity information and translation information as record segment information while performing real-time interview translation on the voice data to be spoken, and finally can obtain a plurality of record segment information to generate the interview record of the voice interview when the interview is finished, the efficiency of data arrangement in voice interview is improved, that is, the generation speed and processing efficiency of the interview record of the voice interview are improved.
In some embodiments of the present application, as shown in fig. 3, after S103, an embodiment of the present application further provides a voice information processing method, including: S105-S106. The following:
s105, abstract extraction is carried out on at least one recorded fragment information by adopting an abstract extraction technology, and full-text abstract information is extracted;
and S106, generating the interview record based on the at least one record fragment information and the full-text abstract information.
The central hub device records at least one piece of recording segment information when the interview is finished, and in addition, the central hub device can also adopt an abstract extraction technology (such as a TextRank algorithm) to extract the abstract of the at least one piece of recording segment information to extract full-text abstract information, wherein the full-text abstract information represents a summary of main contents of the interview of the speakers in the voice interview, namely the full-text abstract information is a full-text summary extracted after all the speeches of all the speakers are summarized. The hub device may then generate an interview record for the interview based on the at least one record fragment information and the full-text summary information.
In the embodiment of the present application, full-text summary information may be placed at the beginning of the interview record, so that the user can know about the interview content and decide whether to continue reading the interview record.
It should be noted that, in the era of information explosion, users need to acquire information quickly. When a certain speech segment of the speaker is long, a abstract extraction technology can be applied to extract a core summary, so that the efficiency of reading the interview record by the user is improved.
In some embodiments of the present application, as shown in fig. 4, after the recording, in S103, the acquisition time, the speaker identity information, and the translation information corresponding to the to-be-concurrently-transmitted voice data to obtain one piece of recording segment information, the method further includes: S107-S109. The following were used:
and S107, abstract extraction is carried out on the information of one recording segment by adopting an abstract extraction technology, and the abstract information of the speaker is extracted.
S108, when the voice interview is finished, acquiring at least one recording fragment message and at least one speaker summary message;
and S109, generating an interview record based on the at least one record fragment information and the at least one speaker summary information.
In the embodiment of the application, after the central hub equipment records the acquisition time, the speaker identity information and the translation information corresponding to the voice data to be simultaneously transmitted to obtain one piece of recorded segment information, the central hub equipment can adopt an abstract extraction technology to extract one piece of recorded segment information to extract the speaker abstract information, that is, after each piece of recorded segment information is generated, the speaker abstract information of each piece of recorded segment information can be simultaneously extracted, so that at least one piece of recorded segment information and at least one piece of speaker abstract information are obtained when the voice interview is finished; finally, the hub device may generate an interview record based on the at least one record fragment information and the at least one speaker summary information.
In some embodiments of the present application, the hub device may also perform digest extraction on each recording segment information after acquiring the at least one recording segment information, so as to obtain at least one speaker digest information, which is not limited in the embodiments of the present application.
It is understood that the hub device can summarize the central idea of each recorded segment information into speaker summary information, so as to improve the reading efficiency of each speech information in the generated interview record.
In some embodiments of the application, the generation of the interview record can be further based on the full-text summary information, the at least one speaker summary information and the at least one record fragment information, so that the generated interview record not only contains the content summary of the full text, but also can include the content summary of each piece of utterance information, thereby greatly improving the embodiment of the main idea in the interview record and further improving the richness and diversity of the interview record.
In some embodiments of the present application, a speaker may be further disposed in the center device, or a speaker may be connected to the center device, the control terminal may control a broadcast language of the speaker, and the speaker may convert the speech of the speaker into the broadcast language for playing in the interview occasion.
Fig. 5 is a first flowchart illustrating a voice information processing method according to a further embodiment of the present application. As shown in fig. 5, the voice information processing method applied to the control terminal includes the following steps:
s201, receiving identity information of a participant, a target language and voiceprint information of the participant.
S202, sending a preset mapping relation formed by the identity information of the participant, the target language and the voiceprint information of the participant to the central equipment.
In the scenario of voice interview, the control terminal may be a smart device in which a specified application (e.g., a simultaneous interpretation application for implementing the voice processing method provided by the present application, etc.) is installed, such as a smart phone, a tablet computer, a computer, etc., and the control terminal may communicate with the central apparatus without limitation in the embodiment of the present application. The control terminal side may be provided with input means that may input some conventional settings such as target language, number of people, and listening language of each earphone microphone terminal.
In the embodiment of the application, before a voice interview is performed, a user can perform entry of relevant information of a participant through a control terminal, for example, identity information and a target language of the participant; meanwhile, the voice print information of the participants is collected through the integrated terminal recording, after the voice print information of the participants of the central equipment is sent to the control terminal, the control terminal corresponds the identity information of the participants, the voice print information of the participants and the target language to form a preset mapping relation, and finally, the control terminal can send the preset mapping relation to the central equipment for the central equipment to use.
In some embodiments of the application, the control terminal may further send the participant identity information and the target language to the hub device, and the integrated terminal sends the participant voiceprint information to the hub device, and the hub device corresponds the participant identity information, the participant voiceprint information, and the target language to form a preset mapping relationship. Therefore, the process of obtaining the preset mapping relationship may be selected according to actual conditions, and the embodiment of the present application is not limited.
The participant identity information may be a name of the participant, or a unique identity, and the like, which is not limited in the embodiment of the present application.
It should be noted that, in the embodiment of the present application, the preset mapping relationship is a corresponding relationship between the participant identity information, the target language, and the voiceprint information of the participant, and may be expressed as a preset mapping relationship that includes a corresponding relationship between the participant identity information and the target language, a corresponding relationship between the participant identity information and the voiceprint information of the participant, a relationship between the target language and the voiceprint information of the participant, a voiceprint information library of the participant, a participant identity information library and a target language library of the participant, and the like.
And S203, receiving an interview trigger instruction when the interview is finished, and generating an interview generation instruction in response to the interview trigger instruction.
And S204, sending the interview generation instruction to the central hub equipment.
And S205, receiving an interview record fed back by the central hub equipment according to the interview generation instruction, wherein the interview record is generated by the central hub equipment responding to the interview generation instruction and based on a preset mapping relation and real-time received voice data to be simultaneously transmitted.
In the embodiment of the application, the control terminal can control the center equipment to generate the interview record, when the interview is finished, a user can trigger the interview record generation function through the control terminal to generate an interview trigger instruction, generate the interview generation instruction and send the interview generation instruction to the center equipment, so that the center equipment can record the determined structural data such as speaker identity information, translation information and the like as recording segment information, namely at least one piece of recording segment information while performing real-time co-transmission translation on the voice data to be co-transmitted, and then the center equipment generates the interview record from the at least one piece of recording segment information and sends the interview record to the control terminal for the control terminal to present the interview record.
The interview record is generated by the central control equipment responding to an interview generation instruction based on a preset mapping relation and real-time received voice data to be simultaneously transmitted.
It can be understood that the control terminal can obtain the interview record from the central hub device, and the interview record is information recorded with voice interview content, so that a user can obtain or view the interview record through the control terminal, the method is convenient and fast, and the intelligence of the control terminal is provided.
In some embodiments of the present application, as shown in fig. 6, after S205, the method further includes: s206, or S207-S208, or S209-S211. The following were used:
and S206, displaying the interview records according to the time shaft sequence.
After the control terminal acquires the interview record, because the interview related information of the speakers in the voice interview recorded in the interview record and the speaking sequence of the speakers has timeliness, the content contained in the interview record can also have timeliness, and thus the control terminal can display the interview record according to the time axis sequence.
In this embodiment of the application, in the case that each utterance in the interview record corresponds to one segment of the interview record, each segment of the interview record in the interview record may include: speaker identity information, acquisition time, voice data to be simultaneously transmitted and translation information. In this way, the control terminal can arrange each section of interview record in the interview records according to the time of acquisition; and displaying the identity information of the speaker, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record.
It should be noted that the to-be-simultaneously-transmitted voice data and the translation information corresponding to each arranged interview record can be sequentially arranged in the corresponding area of the identity information of the speaker through the function key, and when the function key is triggered, the corresponding content of the function key is correspondingly displayed. The text information translated by the voice data to be simultaneously transmitted, namely the source text information, can also be transmitted to the control terminal by the central control equipment, and the translation information can also comprise: the text information and the voice data are translated, so that convenience in display is facilitated, one content can be set as one function key, and a plurality of contents can be jointly realized by two keys. For example, the source text information corresponds to a function key 1, the translation text information corresponds to a function key 2, the voice data corresponds to a function key 3, and when the source text information and the translation text information are triggered simultaneously, the voice data to be transmitted simultaneously is played; when 2 and 3 are triggered simultaneously, the translation voice data is played; when 1 is triggered, displaying source text information; and when the 3 is triggered, the translation text information is displayed.
In some embodiments of the present application, a comparison key may be further provided for comparing and displaying the source text information and the translated text information, and the like, and the embodiments of the present application do not limit the key arrangement mode and arrangement of the content displayed based on the interview record.
It should be noted that, for different listeners, because the target languages of the listeners are different, the translation information in the generated interview records will be different, so that the interview records corresponding to each listener can be generated for each listener, and all utterances can be translated into a general language.
Illustratively, the speaker language is English, and the translation language is Chinese. As shown in fig. 7 and 8, in a display interface of interview records, a control terminal obtains an xxxx interview record, four functional keys of "original", "translated", "audio", and "contrast" are arranged in the display interface, and a speaker includes: the Speaker A, the Speaker B and the Speaker C are provided with four functional keys of 'original', 'translated', 'voice' and 'contrast' in a corresponding area 1 behind the identity information of each Speaker, acquisition time (for example, Speaker A: 2019.08.3122: 10: 15; Speaker B: 2019.08.3122: 12: 10; Speaker C: 2019.08.3122: 13:08), namely speaking starting time is arranged in an area 2 below the identity information of each Speaker, and the speaking records are displayed according to speaking time arrangement. When Speaker a corresponding "translate" is triggered, the translation text information is presented in area 3: "I think that China will be far ahead in a 5G contest! Within one to two years in the future, 5G will start to land on the ground and gain explosive growth ". When Speaker B is triggered corresponding to "original", the source text information is presented in area 4: "I aggregate with you change mech, I thick 5G with bridging a lot of new opportunities". When the Speaker C corresponding "tone" is triggered, the translated speech data is played in region 5. When the Speaker a corresponding "control" is triggered, the source text information is presented in area 6: "I bilieve that China will lead in the 5G competition! In the next one to two years,5G with begin to apply and achieve explicit growth "with the translation of textual information: "I think that China will be far ahead during a 5G contest! Within one to two years in the future, 5G will start to fall to the ground and take a contrast to explosive growth ". As can be seen from the figure, the interview recording segment of each speaker's utterance is time-stamped and can be selected as original, translated, audio, or cross-referenced.
In some embodiments of the present application, each of the interview records further comprises: speaker summary information; after the control terminal displays the identity information of the speaker, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record, the control terminal can also display the summary information of the speaker in a first preset area in the display area of each interview record.
It should be noted that, in this embodiment of the application, if a speech segment of a speaker is relatively long (for example, more than 70 words), the central hub device generates a speech segment summary (i.e., speaker summary information), and carries the speaker summary information in the interview record to send to the control terminal, so that a reader can select and read the speaker summary information on the control terminal, thereby selecting the speech information of a specific speaker.
It can be understood that the central hub device can summarize the central idea of each recorded segment information into speaker summary information, so that the speaker summary information displayed by the control terminal can provide readers to quickly know the main content of each session of interview recording, thereby improving the reading efficiency of each utterance information in the generated interview recording.
Illustratively, as shown in fig. 9, for the utterance of Speaker a, the translation text information is: "I think that China will be far ahead in a 5G contest! Within one to two years in the future, 5G will start to fall to the ground and gain explosive growth. In the previous 234G era, china was always in a passive state, which led foreigners to believe that china has no examples to have first developed 5G. The Chinese became the receiver in the age of 5G, the abstract of the speaker extracted is 'A considers that the Chinese 5G is in the leading position and gains large-scale growth', so the control terminal can display the abstract of the speaker in the area 11 on the display interface of the interview record.
It should be noted that the speaker summary of each speaker may be displayed together with each session record, or may be displayed in the area E by the operation of the user on the display interface of the session record.
In some embodiments of the present application, the interview record further comprises: full text summary information; when the control terminal displays the speaker identity information, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record, the control terminal can also display the full text summary information in front of the speaker identity information, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record.
The hub device may extract the summary information of at least one recording segment by using a summary extraction technique (e.g., TextRank algorithm), and extract full-text summary information, where the full-text summary information represents a summary of main contents of the speech utterers in the speech interview, that is, the full-text summary information is a full-text summary extracted after summarizing all speech of each speaker. Therefore, the central hub equipment can generate an interview record aiming at the interview based on at least one record fragment information and the full-text abstract information and send the interview record to the control terminal.
In the embodiment of the present application, full-text summary information may be placed at the beginning of the interview record, so that the user can know about the interview content and decide whether to continue reading the interview record.
It should be noted that, in the era of information explosion, users need to acquire information quickly. When a certain speech segment of the speaker is long, a abstract extraction technology can be applied to extract a core summary, so that the efficiency of reading the interview record by the user is improved.
Illustratively, as shown in fig. 10, for the xxxx interview record, in the full-text record, i.e., the foremost area F shown by the interview records of multiple segments, the full-text summary information of the voice interview is shown: "full text abstract: ABC has published an important prospect in the 5G new era. A believes that Chinese 5G is far ahead and will be commercialized on a large scale to bring unlimited opportunities, and BC agrees on this.
S207, receiving an editing instruction;
and S208, responding to the editing instruction, editing the interview record to obtain the final interview record and displaying the final interview record.
After the control terminal receives the interview record, errors still exist in the interview record processed by the machine or the interview record needs to be manually added for color retouching and the like, at the moment, the control terminal also provides an editable function, when a user triggers the editable function to be started, the control terminal receives an editing instruction, and therefore in a display interface of the interview record, the editing instruction is responded, editing information is obtained, the interview record is edited according to the obtained editing information, and the final interview record is obtained and displayed.
It can be understood that the editing function perfects the content of the interview record, eliminates some errors and the like, and the obtained final interview record is more accurate and perfect.
S209, receiving a derivation instruction;
s210, responding to the export instruction, and performing preset format processing on the interview record to obtain an export file;
and S211, sharing the export file.
The method comprises the steps that after a control terminal receives an interview record, the control terminal possibly has a requirement for sharing the interview record, at the moment, the control terminal also provides a sharing function, when a user triggers the sharing function to be started, namely the control terminal receives an export instruction, in this way, in a display interface of the interview record, the export format is obtained in response to the export instruction, preset format processing is carried out on the interview record according to the obtained export format, an export file is obtained, and finally the export file is shared.
In an embodiment of the present application, the export format may include: the HTML format, the txt format, the PDF format, and the like, and any text format or web page format, and the like may be used, which is not limited in the embodiment of the present application.
It should be noted that, when the control terminal exports the interview record, different formats can be exported according to different purposes and platforms. Such as exporting txt format plain text, exporting PDF format archives, exporting HTML format shares, etc.
In some embodiments of the application, the voice data related to the interview can be stored in the central hub device or the cloud, so that the control terminal can also attach a link for sharing voice when exporting the interview record, and thus the interview record and the voice link can be shared to other people, so that the other people can obtain text information and voice information of the interview record.
It can be understood that the sharing function provides a function of sharing interview records, and intelligence and diversity in a voice interview scene are improved.
It can be understood that the voice information processing method provided by the application solves the problem of generating interview records, greatly improves the efficiency of post-arrangement of reports, and provides convenient functions of original texts, translated texts, contrasts, audios, abstracts and the like.
An embodiment of the present application provides a method for processing voice information, as shown in fig. 11, including:
s301, the control terminal receives the identity information of the participant, the target language and the voiceprint information of the participant.
S302, the control terminal sends a preset mapping relation formed by the identity information of the participant, the target language and the voiceprint information of the participant to the central equipment.
And S303, after the voice interview starts, the central hub equipment receives the voice data to be simultaneously transmitted of the speaker transmitted by the acquisition terminal from the acquisition terminal and acquires the acquisition time for starting to acquire the voice data to be simultaneously transmitted.
S304, determining identity information of a speaker based on the voice data to be simultaneously transmitted and a preset mapping relation, and simultaneously transmitting and translating the voice data to be simultaneously transmitted into a target language of a listener in real time to obtain translation information; the preset mapping relationship is the corresponding relationship between the identity information of the participant, the target language and the voiceprint information of the participant; wherein the listener is a person of the participants other than the speaker.
S305, the central control equipment sends the translation information to the receiving terminal in real time for the receiving terminal to play the translation information.
S306, the central control equipment records the acquisition time, the identity information of the speaker and the translation information corresponding to the voice data to be simultaneously transmitted to obtain a recording segment information, and then obtains at least one recording segment information when the voice interview is finished.
S307, the central control equipment receives an interview generation instruction sent by the control terminal; generating an interview record for the at least one recording clip information in order of the timeline in response to the interview generation instructions.
And S308, the central hub equipment sends the interview record to the control terminal.
And S309, displaying the interview record by the control terminal.
Exemplarily, the names (participant identity information) and the languages (target languages) of participants are input through a control terminal, the voiceprint information of the participants is collected through recording of an earphone/microphone integrated terminal, and the participant identity information, the target languages and the voiceprint information of the participants are corresponding through a central device to obtain a preset mapping relation; the interview begins, the simultaneous interpretation module in the central hub equipment starts to work, the voice data to be simultaneously interpreted of the speaker is recorded through the earphone/microphone integrated terminal and is sent to the central hub equipment in real time, the central hub equipment completes character writing, translation, recording and translation voice generation according to the preset mapping relation to obtain recorded segment information, then the translated voice is sent to the earphone/microphone integrated terminal of other people (listeners), and the listeners can hear corresponding translated voice. And after the interview is finished, the central hub equipment generates the interview record by the information of the plurality of recording fragments through the control of the control terminal, and finally the interview record is edited through the control terminal to obtain and display the final interview record.
As shown in fig. 12, an embodiment of the present application provides a hub apparatus 1, where the hub apparatus 1 may include:
the first receiving unit 10 is configured to receive voice data to be simultaneously transmitted of a speaker transmitted by an acquisition terminal after a voice interview starts, and acquire acquisition time for starting acquisition of the voice data to be simultaneously transmitted;
the determining unit 11 is configured to determine identity information of a speaker based on the to-be-simultaneously-transmitted voice data and a preset mapping relationship;
the translation unit 12 is configured to translate the to-be-simultaneously-transmitted voice data into a listener target language in real time to obtain translation information; the preset mapping relationship is a corresponding relationship among the identity information of the participant, the target language and the voiceprint information of the participant; wherein a listener is a person of the participants other than the speaker;
the recording unit 13 is configured to record the acquisition time, the speaker identity information, and the translation information corresponding to the to-be-simultaneously-transmitted voice data to obtain a recording segment information, and further obtain at least one recording segment information when a voice interview is finished;
a first generating unit 14 for generating an interview record based on the at least one recording segment information.
In some embodiments of the present application, the determining unit 11 is further configured to determine, from the voiceprint information of the participant in the preset mapping relationship, target voiceprint information that is voiceprint matched with the to-be-co-transmitted voice data; determining the identity information of the speaker corresponding to the voiceprint of the voice data to be simultaneously transmitted based on the target voiceprint information and the corresponding relationship between the participant identity information in the preset mapping relationship and the voiceprint information of the participant, and acquiring the target language of the listener corresponding to the listener based on the corresponding relationship between the participant identity information in the preset mapping relationship and the target voice;
the translation unit 12 is further configured to translate the to-be-simultaneously-transmitted voice data into the listener target language in real time, so as to obtain the translation information.
In some embodiments of the present application, the recording unit 13 is further configured to perform text recognition on the to-be-simultaneously-transmitted voice data to obtain source text information; and recording the acquisition time, the speaker identity information, the translation information and the source text information corresponding to the voice data to be simultaneously transmitted until the target voiceprint information is changed to obtain a recording segment information, and further obtaining the at least one recording segment information when the voice interview is finished.
In some embodiments of the present application, the hub device 1 further comprises: an extraction unit 15;
the extracting unit 15 is configured to record the acquisition time, the speaker identity information, and the translation information corresponding to the to-be-simultaneously-transmitted voice data to obtain a piece of recorded segment information, and further extract, by using an abstract extracting technology, the at least one piece of recorded segment information after obtaining the at least one piece of recorded segment information when a voice interview is finished, and extract full-text abstract information;
the first generating unit 14 is further configured to generate the interview record based on the at least one recording segment information and the full-text summary information.
In some embodiments of the present application, the hub device 1 further comprises: an extraction unit 15;
the extracting unit 15 is configured to extract the abstract of one recording segment information by using an abstract extracting technique after the acquisition time, the speaker identity information, and the translation information corresponding to the to-be-simultaneously-transmitted voice data are recorded to obtain one recording segment information, so as to extract speaker abstract information; and acquiring the at least one recording segment information and the at least one speaker summary information when the voice interview is finished.
In some embodiments of the present application, the first generating unit 14 is further configured to generate the interview record based on the at least one record segment information and the at least one speaker summary information.
In some embodiments of the present application, the hub device 1 further comprises: a first transmission unit 16;
the first receiving unit 10 is further configured to receive an interview generation instruction sent by a control terminal;
the first generating unit 14 is further configured to generate the interview record for the at least one recording clip information in order of a time axis in response to the interview generation instruction;
the first sending unit 16 is configured to send the interview record to the control terminal.
It can be understood that, in a voice interview scene, the central hub device can determine speaker identity information and obtain translation information of a language meeting listener requirements for the voice data to be interviewed of a speaker, and can generate an interview record for the interview based on the information when the interview is finished, so that the central hub device can record structured data such as the determined speaker identity information and translation information as recording segment information while performing real-time interview translation on the voice data to be interviewed, and finally generate the interview record of the voice interview by obtaining a plurality of recording segment information when the interview is finished, thereby improving the efficiency of data arrangement in the voice interview, i.e., improving the generation speed and processing efficiency of the interview record of the voice interview.
As shown in fig. 13, an embodiment of the present application provides a control terminal 2, where the control terminal 2 may include:
a second receiving unit 20, configured to receive participant identity information, a target language, and voiceprint information of a participant;
the mapping unit 21 is configured to send a preset mapping relationship formed by the participant identity information, the target language, and the voiceprint information of the participant to the hub device;
the second receiving unit 20 is further configured to receive an interview trigger instruction when an interview is finished;
a second generating unit 22, configured to generate an interview generation instruction in response to the interview trigger instruction;
a second sending unit 23, configured to send the interview generation instruction to the hub device;
the second receiving unit 20 is configured to receive an interview record fed back by the center device according to the interview generation instruction, where the interview record is generated by the center device responding to the interview generation instruction and based on the preset mapping relationship and the real-time received voice data to be simultaneously transmitted.
In some embodiments of the present application, the control terminal 2 further includes: a display unit 24;
the display unit 24 is configured to display the interview records according to a time axis sequence after receiving the interview records fed back by the central hub device for the interview generation instruction.
In some embodiments of the present application, each of the interview records comprises: speaker identity information, acquisition time, voice data to be simultaneously transmitted and translation information.
In some embodiments of the present application, the display unit 24 is further configured to arrange each of the interview records in a time axis order according to the acquisition time; and displaying the identity information of the speaker, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record.
In some embodiments of the present application, each of the interview records further comprises: speaker summary information;
the display unit 24 is further configured to display the speaker identity information, the to-be-co-transmitted voice data, and the translation information corresponding to each of the arranged interview records, and then display the speaker summary information in a first preset area in a display area of each of the interview records.
In some embodiments of the present application, the interview record further comprises: full text summary information;
the display unit 24 is further configured to display the full text summary information in front of the speaker identity information, the voice data to be communicated, and the translation information corresponding to each arranged interview record when the speaker identity information, the voice data to be communicated, and the translation information corresponding to each arranged interview record are displayed.
In some embodiments of the present application, the control terminal 2 further includes: an editing unit 25 and a display unit 24;
the second receiving unit 20 is further configured to receive an editing instruction after receiving the interview record fed back by the hub device for the interview generation instruction;
the editing unit 25 is configured to respond to the editing instruction and edit the interview record to obtain a final interview record;
and the display unit 24 is used for displaying the final interview record.
In some embodiments of the present application, the control terminal 2 further includes: a derivation unit 26 and a sharing unit 27;
the second receiving unit 20 is configured to receive a derivation instruction after receiving an interview record fed back by the hub device for the interview generation instruction;
the export unit 26 is configured to respond to the export instruction, and perform preset format processing on the interview record to obtain an export file;
the sharing unit 27 is configured to share the export file.
It can be understood that the control terminal can obtain the interview record from the central hub device, and the interview record is information recorded with voice interview content, so that a user can obtain or view the interview record through the control terminal, the method is convenient and fast, and the intelligence of the control terminal is provided.
As shown in fig. 14, an embodiment of the present application provides a hub device, including:
a first processor 17 and a first memory 18;
the first processor 17 is configured to execute the simultaneous interpretation program stored in the first memory 18 to realize the voice information processing method on the center apparatus side.
As shown in fig. 15, an embodiment of the present application provides a control terminal, including:
a second processor 28 and a second memory 29;
the second processor 28 is configured to execute the simultaneous interpretation program stored in the second memory 29, so as to implement the voice information processing method on the control terminal side.
In an embodiment of the present disclosure, the first Processor 17 or the second Processor 28 may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a ProgRAMmable Logic Device (PLD), a Field ProgRAMmable Gate Array (FPGA), a CPU, a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the functions of the first processor 17 or the second processor 28 may be other devices, and the embodiments of the present disclosure are not limited thereto. The hub device further comprises a first memory 18 and the control terminal further comprises a second memory 29, the first memory 18 being connectable to the first processor 17 and the second memory 30 being connectable to the second processor 28. The first memory 18 or the second memory 29 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory, such as at least two disk memories.
In practical applications, the first Memory 18 or the second Memory 29 may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the first processor 17 or the second processor 28.
In addition, each functional module in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.
Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (which is a processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.
Accordingly, an embodiment of the present application also provides a computer-readable storage medium on which a simultaneous interpretation program is stored, the computer program, when executed by one or more first processors, implementing a method for processing voice information on a hub device side.
The embodiment of the application also provides a computer readable storage medium, wherein a simultaneous interpretation program is stored on the computer readable storage medium, and the computer program realizes the voice information processing method on the control terminal side when being executed by one or more second processors.
The computer-readable storage medium may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or may be a respective device, such as a mobile phone, computer, tablet device, personal digital assistant, etc., that includes one or any combination of the above-mentioned memories.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks and/or flowchart block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks in the flowchart and/or block diagram block or blocks.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art should be covered by the scope of the present application within the technical scope disclosed in the present application.
Industrial applicability
In the voice information processing method provided by the embodiment of the application, the central hub device can determine the identity information of a speaker and acquire translation information of a language meeting requirements of a listener in a voice interview scene, and can generate an interview record for the interview based on the information when the interview is finished, so that the central hub device can record the determined structural data such as the identity information of the speaker, the translation information and the like as recording segment information while performing real-time co-transmission translation on the voice data to be interviewed, and finally generates the interview record for the voice interview by obtaining a plurality of recording segment information when the interview is finished.

Claims (20)

  1. A method of processing speech information, comprising:
    after a voice interview starts, receiving voice data to be simultaneously transmitted of a speaker transmitted by an acquisition terminal, and acquiring acquisition time for starting acquisition of the voice data to be simultaneously transmitted;
    determining identity information of a speaker based on the voice data to be simultaneously transmitted and a preset mapping relation, and simultaneously transmitting and translating the voice data to be simultaneously transmitted into a target language of a listener in real time to obtain translation information; the preset mapping relationship is a corresponding relationship among the identity information of the participant, the target language and the voiceprint information of the participant; wherein a listener is a person of the participants other than the speaker;
    recording the acquisition time, the speaker identity information and the translation information corresponding to the voice data to be simultaneously transmitted to obtain a recording segment information, and further obtaining at least one recording segment information when the voice interview is finished;
    generating an interview record based on the at least one recording segment information.
  2. The method according to claim 1, wherein the determining speaker identity information based on the to-be-simulcast voice data and a preset mapping relationship, and performing real-time simulcast translation on the to-be-simulcast voice data into a listener target language to obtain translation information comprises:
    determining target voiceprint information matched with the voiceprint of the voice data to be simultaneously transmitted from the voiceprint information of the participants in the preset mapping relation;
    determining the identity information of the speaker corresponding to the voiceprint of the voice data to be simultaneously transmitted based on the target voiceprint information and the corresponding relationship between the participant identity information in the preset mapping relationship and the voiceprint information of the participant, and acquiring the target language of the listener corresponding to the listener based on the corresponding relationship between the participant identity information in the preset mapping relationship and the target voice;
    and translating the voice data to be simultaneously transmitted into the target language of the listener in real time to obtain the translation information.
  3. The method of claim 1, wherein the recording the acquisition time, the speaker identity information, and the translation information corresponding to the voice data to be simultaneously transmitted to obtain a recording segment information, and further obtaining at least one recording segment information when a voice interview is finished comprises:
    performing text recognition on the voice data to be simultaneously transmitted to obtain source text information;
    recording the acquisition time, the speaker identity information, the translation information and the source text information corresponding to the voice data to be simultaneously transmitted until the target voiceprint information is changed to obtain a recording segment information, and further obtaining the at least one recording segment information when the voice interview is finished.
  4. The method of claim 1, wherein the recording of the acquisition time, the speaker identity information, and the translation information corresponding to the voice data to be co-transmitted results in a recording segment information, and further wherein after at least one recording segment information is obtained at the end of a voice interview, the method further comprises:
    abstract extraction is carried out on the at least one recording segment information by adopting an abstract extraction technology, and full text abstract information is extracted;
    generating an interview record based on the at least one recording segment information, comprising:
    generating the interview record based on the at least one record fragment information and the full-text summary information.
  5. The method according to claim 1 or 4, wherein after recording the acquisition time, the speaker identity information and the translation information corresponding to the to-be-co-transmitted voice data to obtain a recording segment information, the method further comprises:
    abstract extraction is carried out on the information of one recording segment by adopting an abstract extraction technology, and speaker abstract information is extracted;
    and acquiring the at least one recording segment information and the at least one speaker summary information when the voice interview is finished.
  6. The method of claim 5, wherein generating an interview record based on the at least one recording segment information comprises:
    generating the interview record based on the at least one record segment information and the at least one speaker summary information.
  7. The method of claim 1, wherein generating an interview record based on the at least one recording segment information comprises:
    receiving an interview generation instruction sent by a control terminal;
    generating the interview record for the at least one recording clip information in the order of the timeline in response to the interview generation instruction;
    after generating the interview record based on the at least one recording segment information, the method further comprises:
    and sending the interview record to the control terminal.
  8. A method of processing speech information, comprising:
    receiving identity information of a participant, a target language and voiceprint information of the participant;
    sending a preset mapping relation formed by the participant identity information, the target language and the voiceprint information of the participant to the central control equipment;
    when the interview is finished, receiving an interview trigger instruction, and generating an interview generation instruction in response to the interview trigger instruction;
    sending the interview generation instruction to the hub device;
    and receiving an interview record fed back by the central hub equipment aiming at the interview generation instruction, wherein the interview record is generated by the central hub equipment responding to the interview generation instruction and based on the preset mapping relation and the real-time received voice data to be simultaneously transmitted.
  9. The method of claim 8, wherein after receiving an interview record for which the hub device generates instructional feedback for the interview, the method further comprises:
    and displaying the interview records according to a time shaft sequence.
  10. The method of claim 9, wherein,
    each of the interview records comprises: speaker identity information, acquisition time, voice data to be simultaneously transmitted and translation information.
  11. The method of claim 10, wherein the presenting the interview records in timeline order comprises:
    arranging each section of interview record in the interview records according to the time of acquisition;
    and displaying the identity information of the speaker, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record.
  12. The method of claim 11, wherein each of the interview records further comprises: speaker summary information; after displaying the speaker identity information, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record, the method further comprises the following steps:
    and displaying the speaker summary information in a first preset area in the display area of each interview record.
  13. The method of claim 11, wherein the interview record further comprises: full text summary information; when the speaker identity information, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record are displayed, the method further comprises the following steps:
    and displaying the full text summary information in front of the identity information of the speaker, the voice data to be simultaneously transmitted and the translation information corresponding to each arranged interview record.
  14. The method of claim 8, wherein after receiving an interview record for which the hub device generates instructional feedback for the interview, the method further comprises:
    receiving an editing instruction;
    and responding to the editing instruction, editing the interview record to obtain a final interview record and displaying the final interview record.
  15. The method of claim 8, wherein after receiving an interview record for which the hub device generates instructional feedback for the interview, the method further comprises:
    receiving a derivation instruction;
    responding to the export instruction, and performing preset format processing on the interview record to obtain an export file;
    and sharing the export file.
  16. A hub apparatus, comprising:
    the first receiving unit is used for receiving the voice data to be simultaneously transmitted of a speaker transmitted by the acquisition terminal after the voice interview starts and acquiring the acquisition time for starting to acquire the voice data to be simultaneously transmitted;
    the determining unit is used for determining identity information of a speaker based on the voice data to be simultaneously transmitted and a preset mapping relation;
    the translation unit is used for simultaneously translating the voice data to be simultaneously transmitted into a listener target language in real time to obtain translation information; the preset mapping relationship is a corresponding relationship among the identity information of the participant, the target language and the voiceprint information of the participant; wherein a listener is a person of the participants other than the speaker;
    the recording unit is used for recording the acquisition time, the identity information of the speaker and the translation information corresponding to the voice data to be simultaneously transmitted to obtain recording segment information, and further obtaining at least one recording segment information when the voice interview is finished;
    a first generating unit, configured to generate an interview record based on the at least one recording segment information.
  17. A control terminal, comprising:
    the second receiving unit is used for receiving the identity information of the participant, the target language and the voiceprint information of the participant;
    the mapping unit is used for sending a preset mapping relation formed by the participant identity information, the target language and the voiceprint information of the participant to the central equipment;
    the second receiving unit is further used for receiving an interview trigger instruction when the interview is finished;
    a second generating unit, configured to generate an interview generation instruction in response to the interview trigger instruction;
    the second sending unit is used for sending the interview generation instruction to the center equipment;
    the second receiving unit is used for receiving an interview record fed back by the center equipment according to the interview generation instruction, wherein the interview record is generated by the center equipment responding to the interview generation instruction and based on the preset mapping relation and the real-time received voice data to be simultaneously transmitted.
  18. A hub apparatus, comprising:
    a first processor and a first memory;
    the first processor is configured to execute the simultaneous interpretation program stored in the first memory to implement the speech information processing method according to any one of claims 1 to 7.
  19. A control terminal, comprising:
    a second processor and a second memory;
    the second processor is configured to execute the simultaneous interpretation program stored in the second memory to implement the speech information processing method according to any one of claims 8 to 15.
  20. A storage medium on which a simultaneous interpretation program is stored, the simultaneous interpretation program realizing the voice information processing method according to any one of claims 1 to 7 when executed by a first processor; alternatively, the simultaneous interpretation program when executed by the second processor implements the speech information processing method according to any one of claims 8 to 15.
CN201980101053.3A 2019-12-30 2019-12-30 Voice information processing method, center device, control terminal and storage medium Pending CN114503117A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130075 WO2021134284A1 (en) 2019-12-30 2019-12-30 Voice information processing method, hub device, control terminal and storage medium

Publications (1)

Publication Number Publication Date
CN114503117A true CN114503117A (en) 2022-05-13

Family

ID=76686162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980101053.3A Pending CN114503117A (en) 2019-12-30 2019-12-30 Voice information processing method, center device, control terminal and storage medium

Country Status (2)

Country Link
CN (1) CN114503117A (en)
WO (1) WO2021134284A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116383432A (en) * 2023-04-20 2023-07-04 中关村科学城城市大脑股份有限公司 Audio data screening method and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001014314A (en) * 1999-06-28 2001-01-19 Sony Corp Simultaneous translation system
US9128926B2 (en) * 2006-10-26 2015-09-08 Facebook, Inc. Simultaneous translation of open domain lectures and speeches
CN108305632B (en) * 2018-02-02 2020-03-27 深圳市鹰硕技术有限公司 Method and system for forming voice abstract of conference
CN108922538B (en) * 2018-05-29 2023-04-07 平安科技(深圳)有限公司 Conference information recording method, conference information recording device, computer equipment and storage medium
CN108766414B (en) * 2018-06-29 2021-01-15 北京百度网讯科技有限公司 Method, apparatus, device and computer-readable storage medium for speech translation
CN109741754A (en) * 2018-12-10 2019-05-10 上海思创华信信息技术有限公司 A kind of conference voice recognition methods and system, storage medium and terminal
CN110491385A (en) * 2019-07-24 2019-11-22 深圳市合言信息科技有限公司 Simultaneous interpretation method, apparatus, electronic device and computer readable storage medium
CN110516265A (en) * 2019-08-31 2019-11-29 青岛谷力互联科技有限公司 A kind of single identification real-time translation system based on intelligent sound

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116383432A (en) * 2023-04-20 2023-07-04 中关村科学城城市大脑股份有限公司 Audio data screening method and system

Also Published As

Publication number Publication date
WO2021134284A1 (en) 2021-07-08

Similar Documents

Publication Publication Date Title
CN108615527B (en) Data processing method, device and storage medium based on simultaneous interpretation
TWI249729B (en) Voice browser dialog enabler for a communication system
CN107657471B (en) Virtual resource display method, client and plug-in
CN112653902B (en) Speaker recognition method and device and electronic equipment
CN104700836A (en) Voice recognition method and voice recognition system
CN112069353B (en) Music playing control method and device, storage medium and electronic equipment
CN105488116A (en) Online live broadcasting based message display method and client
JP2007079745A (en) Network connection device, server device, terminal equipment, system, reception method, character input method, transmission method, program, and computer readable recording medium
CN111563151B (en) Information acquisition method, session configuration method, device and storage medium
CN106713111B (en) Processing method for adding friends, terminal and server
WO2019047850A1 (en) Identifier displaying method and device, request responding method and device
CN109327614B (en) Global simultaneous interpretation mobile phone and method
CN103346953A (en) Method, device and system for group communication data interaction
CN112084756A (en) Conference file generation method and device and electronic equipment
CN107147564A (en) Real-time speech recognition error correction system and identification error correction method based on cloud server
KR102357620B1 (en) Chatbot integration agent platform system and service method thereof
CN108364638A (en) A kind of voice data processing method, device, electronic equipment and storage medium
US20220005483A1 (en) Group Chat Voice Information Processing Method and Apparatus, Storage Medium, and Server
CN114503117A (en) Voice information processing method, center device, control terminal and storage medium
CN112562677B (en) Conference voice transcription method, device, equipment and storage medium
CN114244793A (en) Information processing method, device, equipment and storage medium
JP2008305410A (en) Network connection device, server device, terminal, system, character input method, transmission method, program, and computer readable recording medium
CN104853252B (en) A kind of interactive more homepage control methods, device and system
JP7417272B2 (en) Terminal device, server device, distribution method, learning device acquisition method, and program
CN112447179A (en) Voice interaction method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination