CN116416995A

CN116416995A - Voice transcription method, electronic equipment and storage medium

Info

Publication number: CN116416995A
Application number: CN202310409196.9A
Authority: CN
Inventors: 宋洪博; 王艳龙; 陈永波; 储磊; 沈峥嵘
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2023-04-17
Filing date: 2023-04-17
Publication date: 2023-07-11

Abstract

The invention discloses a voice transcription method, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring real-time audio, sending the audio to a cloud for voice transcription, and receiving a text corresponding to the audio data returned by the cloud, wherein the text comprises a near-end voice text and a far-end voice text; and acquiring a local video stream in real time based on the intelligent camera, and sending the local video stream and the text to a terminal, wherein the terminal can superimpose the text on the local video stream. According to the embodiment of the invention, the local video stream is obtained through the intelligent camera, the text corresponding to the local video stream and the real-time audio data is combined to obtain the video with the real-time subtitle, the voice transcription of the near-end audio and the far-end audio can be respectively realized, and the conference record is automatically generated.

Description

Voice transcription method, electronic equipment and storage medium

Technical Field

The present invention relates to speech recognition technology, and in particular, to a speech transcription method, an electronic device, and a storage medium.

Background

The prior art relates to main equipment comprising: the conference integrated machine comprises a camera, a microphone, a loudspeaker and other hardware devices, and an operating system, wherein conference software is installed in the operating system, and in the conference process, the conference recording function is opened, so that the content of a video conference can be recorded, and a conference record that voice is converted into characters can be obtained after the conference. The traditional conference scheme comprises hardware such as a tripod head camera, a gooseneck microphone (or an omnidirectional microphone), a conference host, a loudspeaker and the like, and conference software runs on the conference host or a user PC to record a conference.

The existing conference all-in-one machine mainly collects local recording through a microphone of the all-in-one machine and plays downlink audio through a loudspeaker of the all-in-one machine, and the all-in-one machine can take all the uplink and downlink audio. Conference software running on the integrated machine can realize the function of converting voice into text by uploading uplink and downlink audio to the transcription service. In the traditional conference scheme, the file transfer function is realized by uploading uplink and downlink audios through a computer in a conference.

The inventors found that: the existing conference equipment is inconvenient in voice-to-text operation, and needs to rely on conference software or special transfer software on a computer or equipment to realize voice recording or voice-to-text functions; if the image transmitted to the far end in the video conference process wants to add the real-time caption function, the far end user who needs to receive the video adds the caption to the video through the related tool, and the transfer and caption functions can not be realized under the condition of not using the related tool.

Disclosure of Invention

Embodiments of the present invention aim to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a voice transcription method, configured to be used in an intelligent camera, including: acquiring real-time audio, sending the audio to a cloud for voice transcription, and receiving a text corresponding to the audio data returned by the cloud, wherein the text comprises a near-end voice text and a far-end voice text; and acquiring a local video stream in real time based on the intelligent camera, and sending the local video stream and the text to a terminal, wherein the terminal can superimpose the text on the local video stream.

In a second aspect, an embodiment of the present invention provides an electronic device, including: the speech transcription device comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the speech transcription methods of the present invention.

In a third aspect, embodiments of the present invention provide a storage medium having stored therein one or more programs including execution instructions that can be read and executed by an electronic device (including, but not limited to, a computer, a server, or a network device, etc.) for performing any one of the above-described voice transcription methods of the present invention.

In a fourth aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any one of the above-described speech transcription methods.

According to the embodiment of the invention, the local video stream is obtained through the intelligent camera, the text corresponding to the local video stream and the real-time audio data is combined to obtain the video with the real-time subtitle, the voice transcription of the near-end audio and the far-end audio can be respectively realized, and the conference record is automatically generated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a speech transcription method according to an embodiment of the present invention;

FIG. 2 is a speech transcription flow chart of the speech transcription method of the present invention;

FIG. 3 is a flowchart illustrating a process for implementing a speech transcription method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an embodiment of an electronic device of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In the present invention, "module," "device," "system," and the like refer to a related entity, either hardware, a combination of hardware and software, or software in execution, as applied to a computer. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, the application or script running on the server, the server may be an element. One or more elements may be in processes and/or threads of execution, and elements may be localized on one computer and/or distributed between two or more computers, and may be run by various computer readable media. The elements may also communicate by way of local and/or remote processes in accordance with a signal having one or more data packets, e.g., a signal from one data packet interacting with another element in a local system, distributed system, and/or across a network of the internet with other systems by way of the signal.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," comprising, "or" includes not only those elements but also other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The embodiment of the invention provides a voice transcription method which can be applied to electronic equipment. The electronic device may be a computer, a server, or other electronic products, etc., which the present invention is not limited to.

Referring to fig. 1, a voice transcription method for an intelligent camera according to an embodiment of the invention is shown.

As shown in fig. 1, in step 101, real-time audio is acquired, the audio is sent to a cloud for voice transcription, and text corresponding to the audio data returned by the cloud is received, wherein the text includes a near-end voice text and a far-end voice text;

in step 102, a local video stream is obtained in real time based on the smart camera, and the local video stream and the text are sent to a terminal, wherein the terminal can superimpose the text into the local video stream.

In this embodiment, for step 101, the intelligent camera is connected to the microphone device, and audio data is acquired in real time through the microphone device, the intelligent camera is connected to the microphone through the USB host interface, and when the intelligent camera is started, the microphone device synchronously acquires audio; and sending the audio data to the cloud for voice transcription in the acquired audio process, wherein the acquired audio comprises two types, one type is locally acquired audio, the other type is audio transmitted from the far end of the conference, the two types of audio are respectively sent to the cloud for voice transcription, the two audio streams are respectively uploaded to cloud transcription services, and then the two types of audio streams can be respectively recorded, so that the voice transcription content of the near-far end in the conference is realized. And receiving texts corresponding to the two types of audio after the cloud transcription is completed after the voice transcription is completed, wherein the locally acquired audio corresponds to a near-end voice text, and the audio at the far-end of the conference corresponds to a far-end text.

Then, for step 102, real-time video of the conference site is photographed in real time by using the intelligent camera, and the photographed real-time video, the near-end voice text and the far-end voice text are sent to the PC computer end, wherein the intelligent camera is connected with the PC computer end through the USB (universal serial bus) salve interface, so that the camera is used as a peripheral of a camera and a microphone device of the PC. The intelligent camera sends the video, the near-end voice text and the far-end voice text which are acquired in real time to a PC computer end, and the near-end voice text and the far-end voice text are overlapped into image frames of the video by the computer to obtain continuous image frames with characters. For example, the text image is suspended above each frame of image collected by the camera to be synthesized into a new frame of image, and the continuous image frames output by the camera are synthesized images, so that the superposition effect can be realized.

According to the method, the local video stream is obtained through the intelligent camera, the text corresponding to the local video stream and the real-time audio data is combined to obtain the video with the real-time subtitle, voice transcription of near-end audio and far-end audio can be respectively realized, and conference records are automatically generated.

In some optional embodiments, first audio data acquired locally and second audio data transmitted by at least one conference end are received, where the first audio data is acquired locally, the second audio data is audio transmitted by a far-end conference, after speech transcription is completed, the first audio data corresponds to near-end speech text, and the second audio data corresponds to far-end speech text. The intelligent camera is further provided with at least one network port, the network port is used for connecting a network, the acquired audio is uploaded to the cloud for voice transcription, the intelligent camera is connected with a network cable through a network interface after acquiring real-time audio, the acquired audio is sent to the cloud for voice transcription, the real-time audio comprises first audio data and second audio data, and the first audio data and the second audio data are respectively sent to the cloud for voice transcription through the network interface. For example: the local user opens the remote conference through the Tencent conference, and the remote user accesses through the PC Tencent conference. The voice of the remote user is recorded by the Telecommunications conference client and transmitted to the local Telecommunications conference client over the network. The local first audio is: audio picked up by a microphone speaker; the local second audio is: remote audio transmitted over a network.

In some optional embodiments, the intelligent camera may further receive text data after the cloud transcription is completed, where the intelligent camera receives the near-end voice text corresponding to the first audio data and the far-end voice text corresponding to the second audio data respectively, and the receiving manner is also to transmit through a network cable connected to the network port. After receiving the text data which is transcribed by the cloud, the intelligent camera sends a near-end voice text and a far-end voice text in the text data and a local real-time video stream shot by the intelligent camera to the terminal for processing, wherein the terminal is not limited to intelligent equipment with processing capability, such as intelligent equipment with PC (personal computer), tablet personal computer or server, and the like, and the processing mode is that the near-end voice text and the far-end voice text are overlapped into the video stream, so that continuous image frames with characters can be obtained.

In some optional embodiments, the smart camera sends the near-end voice text and the far-end voice text in the text data and the local real-time video stream shot by the smart camera to the terminal to process to obtain the continuous image frames with characters, then sends the video formed by the continuous image frames with characters to at least one conference end, where the at least one conference end can be an online conference end or an offline conference end, and the video formed by the continuous image frames with characters can be received only after the user logs in the account corresponding to the conference end, for example, the smart camera sends the prepared video with the caption to the account logged in before the conference end, and the user can receive the video with the caption when entering the conference after logging in the account corresponding to the conference end. After the user logs in the account corresponding to the conference terminal, the user can also view, download or edit the continuous image frames with the characters through the account corresponding to at least one conference terminal. The user can bind the personal account number, the conference text record generated after each conference can be associated with the personal account number, and the user can view, download and edit the conference text record through background software.

It should be noted that, the smart camera in the present application is provided with at least one microphone, and the at least one microphone is connected to the smart camera through a USB interface. The intelligent camera supports an external microphone sound box device, and can transmit uplink and downlink audio to a PC computer.

Referring to fig. 2, a speech transcription method of the present invention is shown for a speech transcription flow chart of an intelligent camera.

As shown in fig. 2, the pan-tilt camera includes three interfaces, a USB save port, a USB host port, and a network port.

The USB host port is used for externally connecting pickup microphone sound box equipment and transmitting uplink and downlink audio to the PC.

The USB salve port is connected with the PC, so that the camera is used as a camera of the PC and a microphone sound box is arranged outside.

The internet access is used for surfing the internet, uploading the audio data to cloud transcription service, converting the voice into characters and sending the characters to the camera.

Referring to fig. 3, a flowchart of a process for implementing the voice transcription method of the present invention for an intelligent camera is shown.

As shown in fig. 3, the speech transcription: the intelligent camera is used as the HUB of the audio, the uplink audio and the downlink audio can be obtained, the locally collected audio and the audio transmitted from the far end of the conference correspond to each other respectively, the two audio streams are uploaded to the cloud transcription service respectively, and then the voice transcription contents of the near-far end in the next conference can be recorded respectively.

Real-time caption: the intelligent camera collects local image streams, uploads the local image streams to the PC and sends the local image streams to the far-end through conference software. The video camera can obtain the transfer text returned by the transfer service in real time by uploading the audio in real time, and the text is overlapped to the image frames and transmitted to the far end, so that the continuous image frames have text information, and the video stream seen by the far end user is the function of overlapped real-time subtitles.

And (3) conference record generation: after the video camera is unpacked, the user can bind the personal account, the conference text record generated after each conference can be associated with the personal account, and the user can view, download and edit the conference text record through background software.

It should be noted that, for simplicity of description, the foregoing method embodiments are all illustrated as a series of acts combined, but it should be understood and appreciated by those skilled in the art that the present invention is not limited by the order of acts, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention. In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In some embodiments, embodiments of the present invention provide a non-transitory computer readable storage medium having stored therein one or more programs including execution instructions that can be read and executed by an electronic device (including, but not limited to, a computer, a server, or a network device, etc.) for performing any of the above-described speech transcription methods of the present invention.

In some embodiments, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-volatile computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the above-described speech transcription methods.

In some embodiments, the present invention further provides an electronic device, including: the system comprises at least one processor and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech transcription method.

Fig. 4 is a schematic hardware structure of an electronic device for performing a voice transcription method according to another embodiment of the present application, as shown in fig. 4, where the device includes:

one or more processors 410, and a memory 420, one processor 410 being illustrated in fig. 4.

The apparatus for performing the voice transcription method may further include: an input device 430 and an output device 440.

The processor 410, memory 420, input device 430, and output device 440 may be connected by a bus or other means, for example in fig. 4.

The memory 420 is used as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the voice transcription method in the embodiments of the present application. The processor 410 executes various functional applications of the server and data processing, i.e., implements the above-described method embodiment voice transcription method, by running non-volatile software programs, instructions, and modules stored in the memory 420.

Memory 420 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the voice transcription apparatus, and the like. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 420 may optionally include memory located remotely from processor 410, which may be connected to the voice transcription device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 may receive input numeric or character information and generate signals related to user settings and function control of the speech transcription apparatus. The output 440 may include a display device such as a display screen.

The one or more modules are stored in the memory 420 that, when executed by the one or more processors 410, perform the speech transcription method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exist in a variety of forms including, but not limited to:

(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.

(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID, and UMPC devices, etc.

(3) Portable entertainment devices such devices can display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.

(4) Other on-board electronic devices with data interaction functions, such as on-board devices mounted on vehicles.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A voice transcription method for an intelligent camera, comprising:

acquiring real-time audio, sending the audio to a cloud for voice transcription, and receiving a text corresponding to the audio data returned by the cloud, wherein the text comprises a near-end voice text and a far-end voice text;

and acquiring a local video stream in real time based on the intelligent camera, and sending the local video stream and the text to a terminal, wherein the terminal can superimpose the text on the local video stream.

2. The method of claim 1, wherein the acquiring real-time audio comprises:

receiving first audio data acquired locally and second audio data transmitted by at least one conference end, wherein the real-time audio comprises the first audio data and the second audio data, and the intelligent camera is provided with at least one network port used for respectively transmitting the first audio data and the second audio data.

3. The method of claim 2, wherein the sending the audio to the cloud for voice transcription comprises:

and respectively transmitting the first audio data and the second audio data to the cloud end through the at least one network port, wherein the cloud end can respectively carry out voice transcription on the first audio data and the second audio data.

4. The method of claim 2, wherein the receiving text corresponding to the audio data returned by the cloud comprises:

and receiving the near-end voice text corresponding to the first audio data and the far-end voice text corresponding to the second audio data which are completed by the cloud transcription.

5. The method of claim 2, wherein the sending the local video stream and the text to a terminal comprises:

and sending the near-end voice text and the far-end voice text to the terminal by combining the local video stream to carry out superposition processing, so as to obtain continuous image frames with characters.

6. The method of claim 5, wherein after the sending the local video stream and the text to a terminal, comprising:

and sending the continuous image frames with the characters to an account corresponding to the at least one conference end.

7. The method of claim 6, wherein the user is able to view, download or edit the continuous image frames with text via an account number corresponding to the at least one conference end.

8. The method of claim 1, wherein the smart camera is provided with at least one microphone connected to the smart camera through a USB interface.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 8.

10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1 to 8.