WO2019015505A1 - 信息处理方法、系统、电子设备、和计算机存储介质 - Google Patents

信息处理方法、系统、电子设备、和计算机存储介质 Download PDF

Info

Publication number
WO2019015505A1
WO2019015505A1 PCT/CN2018/095081 CN2018095081W WO2019015505A1 WO 2019015505 A1 WO2019015505 A1 WO 2019015505A1 CN 2018095081 W CN2018095081 W CN 2018095081W WO 2019015505 A1 WO2019015505 A1 WO 2019015505A1
Authority
WO
WIPO (PCT)
Prior art keywords
text information
information
audio
text
audio information
Prior art date
Application number
PCT/CN2018/095081
Other languages
English (en)
French (fr)
Inventor
徐冲
李威
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2019015505A1 publication Critical patent/WO2019015505A1/zh
Priority to US16/742,753 priority Critical patent/US11664030B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0487Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser
    • G06F3/0488Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
    • G06F3/04883Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures for inputting data by handwriting, e.g. gesture or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/106Display of layout of documents; Previewing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/228Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context

Definitions

  • the present specification relates to the field of computer technology, and in particular, to an information processing method, system, electronic device, and computer storage medium.
  • Embodiments of the present specification may facilitate the generation of records.
  • an embodiment of the present disclosure provides an information processing method, the method comprising: receiving first text information input by a first input device; the first text information is generated according to a voice; and receiving a second input device to record Audio information; wherein the audio information is generated according to the voice recording; performing voice recognition on the audio information to obtain second text information; displaying the first text information and the second text information, wherein There is a correspondence between the first text information and the content in the second text information.
  • the embodiment of the present specification further provides an information processing system, the information processing system includes: an input device, an audio collection terminal, a display, and a processor; the input device is configured to receive first text information input by a user; a text information is input by the user according to the voice; the audio collection terminal is configured to record audio information; wherein the audio information is generated according to the voice recording; the processor is configured to perform voice on the audio information Identifying the second text information; the display is configured to display the first text information and the second text information, wherein the content of the first text information and the second text information have a corresponding relationship.
  • Embodiments of the present specification further provide a computer storage medium storing computer program instructions, when the computer program instructions are executed, receiving: receiving first text information input by a first input device, the first text The information is generated according to voice; receiving audio information recorded by the second input device, wherein the audio information is generated according to the voice recording; performing voice recognition on the audio information to obtain second text information; displaying the first text The information and the second text information, wherein the content of the first text information and the second text information have a corresponding relationship.
  • the embodiment of the present specification further provides an information processing method, the method comprising: receiving first text information input by a first input device; the first text information is generated according to voice; and receiving audio information recorded by a second input device; The audio information is generated according to the voice recording; the audio information or the characterization information of the audio information is sent to a server for voice recognition by the server; and the voice recognition received by the server is received.
  • the second text information; the first text information and the second text information are displayed, wherein the content of the first text information and the second text information have a corresponding relationship.
  • the embodiment of the present specification further provides an information processing system, the information processing system includes: an input device, an audio collection terminal, a network communication unit, and a display; the input device is configured to receive first text information input by a user; The first text information is input by the user according to the voice; the audio collection terminal is configured to record audio information; wherein the audio information is generated according to the voice recording; the network communication unit is configured to use the audio information Or the characterization information of the audio information is sent to the server for voice recognition by the server; the second text information obtained by the voice recognition fed back by the server is received; the display is configured to display the first text information And the second text information, wherein the content of the first text information and the second text information have a corresponding relationship.
  • Embodiments of the present specification further provide a computer storage medium storing computer program instructions, when the computer program instructions are executed, receiving: receiving first text information input by a first input device, the first text The information is generated according to voice; receiving audio information recorded by the second input device, wherein the audio information is generated according to the voice recording; and the audio information or the characterization information of the audio information is sent to a server for use in The server performs voice recognition; receives second text information obtained by voice recognition fed back by the server; and displays the first text information and the second text information, wherein the first text information and the second text There is a correspondence between the contents of the text information.
  • the embodiment of the present specification further provides an information processing method, the method comprising: receiving characterization information of audio information or audio information sent by a client; performing voice recognition on the audio information or the characterization information to obtain second text information; Sending the second text information to the client, where the client displays the second text information with the first text information of the first client; wherein the first text information and the There is a correspondence between the contents of the second text information.
  • the embodiment of the present specification further provides an electronic device, where the electronic device includes: a network communication unit and a processor; the network communication unit is configured to receive the characterization information of the audio information or the audio information sent by the client; The second text information provided by the device is sent to the client, where the client displays the second text information with the first text information of the first client; wherein the first text information and the The content of the second text information has a corresponding relationship; the processor is configured to perform voice recognition on the audio information or the characterization information to obtain second text information.
  • the embodiment of the present specification further provides a computer storage medium, wherein the computer storage medium stores computer program instructions, and the computer program instructions are executed to: receive the characterization information of the audio information or the audio information sent by the client; Transmitting the audio information or the characterization information to obtain second text information; and sending the second text information to the client, where the client sends the second text information to the first client
  • the first text information of the end is displayed, wherein the content of the first text information and the second text information have a corresponding relationship.
  • the embodiment of the present specification further provides an information processing method, including: receiving first text information; the first text information is generated according to voice; receiving audio information; wherein the audio information is generated according to the voice recording; Performing voice recognition on the audio information to obtain second text information; performing typesetting according to a correspondence between the first text information and the second text information, the first text information after the typesetting and the first Two text messages are used for display.
  • An embodiment of the present disclosure further provides an electronic device, the electronic device includes a processor, the processor is configured to receive first text information, the first text information is generated according to a voice, and receive audio information, where the audio Generating information according to the voice recording; performing voice recognition on the audio information to obtain second text information; performing typesetting according to a correspondence between the first text information and the second text information, the typesetting The first text information and the second text information are used for presentation.
  • the embodiment of the present specification further provides a computer storage medium, where the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the first text information is received, and the first text information is generated according to voice.
  • Receiving audio information wherein the audio information is generated according to the voice recording; performing voice recognition on the audio information to obtain second text information; according to the first text information and the second text information
  • the correspondence is typeset, and the first text information and the second text information after the layout are used for presentation.
  • the embodiment of the present specification further provides an information processing method, the method comprising: receiving first text information input by a first input device; the first text information is generated according to voice; and receiving audio information recorded by a second input device; The audio information is generated according to the voice recording; the audio information is identified to obtain second text information; the first text information is displayed in a first area; and the second text information is displayed in a second area.
  • the content of the first text information and the second text information have a corresponding relationship, and the first area and the second area are located in the same interface.
  • the first text information and the second text information can be obtained by manually inputting and recording audio after speech recognition for the same voice scene.
  • the second text information can be used to comprehensively record the characteristics of the voice content, and the first text information is modified.
  • the first text information can be more comprehensive and accurate, and the emphasis can be emphasized, as well as the content of the summary recorded voice.
  • the first text information can have a smaller space than the second text information, and the emphasis is relatively prominent, which can save the reader's reading time.
  • FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart diagram of an information processing method according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of an interface provided by an embodiment of the present specification.
  • FIG. 4 is a schematic diagram of an interface provided by an embodiment of the present specification.
  • FIG. 5 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of an interface provided by an embodiment of the present specification.
  • Figure 7 is a schematic diagram of an interface that has been heard in an embodiment of the present specification.
  • FIG. 8 is a schematic flowchart diagram of an information processing method according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic flowchart diagram of an information processing method according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic flowchart diagram of an information processing method according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic flowchart diagram of an information processing method according to an embodiment of the present disclosure.
  • FIG. 12 is a schematic diagram of a module of an electronic device according to an embodiment of the present disclosure.
  • FIG. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • FIG. 14 is a schematic flowchart diagram of an information processing method according to an embodiment of the present disclosure.
  • FIG. 15 is a schematic diagram of a module of an electronic device according to an embodiment of the present disclosure.
  • FIG. 16 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • FIG. 17 is a schematic flowchart diagram of an information processing method according to an embodiment of the present disclosure.
  • FIG. 18 is a schematic diagram of a module of an electronic device according to an embodiment of the present disclosure.
  • FIG. 19 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • 20 is a schematic diagram of an information processing system according to an embodiment of the present specification.
  • FIG. 21 is a schematic flowchart diagram of an information processing method according to an embodiment of the present disclosure.
  • FIG. 22 is a schematic flowchart diagram of an information processing method according to an embodiment of the present disclosure.
  • the information processing system may include an audio information collection subsystem, a shorthand subsystem, a voice recognition subsystem, a typesetting subsystem, and a text correction subsystem.
  • the audio information collection subsystem can receive the audio information provided by the audio collection terminal, and provide the audio information to the voice recognition subsystem.
  • the audio information collection subsystem may perform preliminary screening on the audio information to avoid transmitting a plurality of audio information recording the same voice to the voice recognition subsystem when there are multiple audio collection terminals.
  • the audio information collecting subsystem can compare the audio information provided by the plurality of audio collecting terminals to perform waveform comparison to determine whether more than two audio collecting terminals collect the same audio information. When it is found that the waveforms of the audio information provided by the two or more audio collection terminals tend to be the same, the audio information with the strongest waveform energy can be provided to the voice recognition subsystem.
  • the audio information collection subsystem may set a label name corresponding to each audio collection terminal.
  • the identifier name of the audio collection terminal that collects the audio information is correspondingly transmitted.
  • the speech recognition subsystem can associate the identified content with the tag name.
  • an audio collection terminal is placed in a seat for use by a user located at the seat. Therefore, there is a correspondence between the label name and the user.
  • the audio collection subsystem can record the reception time of the received audio information.
  • the reception time can be sent to the voice recognition subsystem.
  • it is not limited to sending the reception time and the audio information to the speech recognition subsystem, and the generation time of the audio information and the audio information may be sent to the speech recognition subsystem.
  • the audio collection subsystem can continuously use the entire content of the recording process as one piece of audio information. It is also possible that the audio collection subsystem divides a plurality of audio information during one recording. For example, the audio information is divided according to the duration of the recording. For example, each recording 20 milliseconds is an audio message, forming an audio message. Of course, the audio information may not be limited to 20 milliseconds, and the specific duration may be selected from 20 milliseconds to 500 milliseconds. Alternatively, the audio information is divided according to the amount of data. For example, each audio message can be up to 5MB.
  • the audio information is divided according to the continuous condition of the sound waveform in the audio information, for example, there is a silent portion that lasts for a certain period of time between adjacent two consecutive waveforms, and each successive sound waveform in the audio information is divided into one audio information. .
  • the shorthand subsystem is used for providing input to the recording personnel, that is, the first text recorded by the recording personnel according to the voice heard, under the control of the recorder, for the content of the voice. information.
  • the shorthand subsystem can receive the first text information input by the recording personnel operating the input device.
  • the shorthand subsystem can provide first text information input by the recorder to the typesetting subsystem for use in the typesetting subsystem for typesetting.
  • the shorthand subsystem can record the time information of the first text information.
  • the time information may be the time when the first text information is input, or the time when the first text information is completed.
  • the shorthand subsystem can provide the time information along with the first text information to the typesetting subsystem.
  • the first text information may have the name of the corresponding speaker. This allows the first textual information to more intuitively represent the identity and content of the speaker. Specifically, for example, the first text information may be “Xiao Ming said: ‘Xiao Zhang owes me 100 yuan...’”.
  • the speech recognition subsystem may perform speech recognition for the audio information to derive second text information representing the speech in the audio information.
  • the speech recognition subsystem may collect data from the audio information according to a preset algorithm, and output a feature matrix of features of the audio data including the audio information.
  • the user's voice has the user's own characteristics, such as tone, intonation, speech rate, and so on.
  • each user's own sound characteristics can be reflected from the frequency, amplitude, and the like in the audio data.
  • the feature matrix that causes the audio information to be generated according to a preset algorithm will include the characteristics of the audio data in the audio information.
  • the speech feature vector generated based on the feature matrix can be used to characterize the audio information and the audio data.
  • the preset algorithm may be MFCC (Mel Frequency Cepstrum Coefficient), MFSC (Mel Frequency Spectral Coefficient), FMFCC (Fractional Mel Frequency Cepstrum Coefficient), DMFCC (Discriminative), LPCC (Linear Prediction Cepstrum Coefficient), or the like.
  • MFCC Mel Frequency Cepstrum Coefficient
  • MFSC Mel Frequency Spectral Coefficient
  • FMFCC Fractional Mel Frequency Cepstrum Coefficient
  • DMFCC Diacriminative
  • LPCC Linear Prediction Cepstrum Coefficient
  • an endpoint detection process may also be included. Furthermore, the data corresponding to the audio data of the non-user voice can be reduced in the feature matrix, and thus, the degree of association between the generated voice feature vector and the user can be improved to some extent.
  • the method of endpoint detection processing may include, but is not limited to, energy-based endpoint detection, cepstrum feature-based endpoint detection, information entropy-based endpoint detection, endpoint correlation detection based on self-correlation similar distance, and the like, which are not enumerated herein.
  • the speech recognition subsystem may process the feature matrix using a speech recognition algorithm to obtain second text information expressed in the audio information.
  • the speech recognition algorithm may perform speech recognition on the audio information by using a hidden Markov algorithm or a neural network algorithm.
  • the identified second text information may be provided to the typesetting subsystem.
  • the second text information and the first text information are correspondingly typeset for the typesetting subsystem.
  • the speech recognition subsystem receives time information and/or tagnames corresponding to the audio information provided by the speech acquisition subsystem.
  • the speech recognition subsystem can provide time information and/or tagnames along with the second textual information to the typesetting subsystem.
  • the typesetting subsystem may typeset the received first text information and the second text information.
  • the first text information and the second text information that cause the correspondence relationship to be present may be correspondingly displayed at the time of display.
  • Corresponding relationships include, but are not limited to, audio information and first text information that are generated at the same time, and the second text information of the audio information is closer to the first text information display position; or audio that tends to be generated at the same time Information and first text information, the second text information of the audio information having the same display style as the first text information; or the audio information and the first text information that are generated at the same time, the second of the audio information
  • the text message has a time stamp that is close to the first text information.
  • the text correction subsystem may be configured to receive a modification of the first text information by the recorder to form a final text.
  • the first text information is text information that the recording person performs quick input according to the content that is heard during the process of multi-person communication. Limited to recording the input speed of the person and the ability to understand the communication with multiple people, the content of the first text message may be more refined, but may not be comprehensive enough. In some cases, some of the more important content may be missed, or some content may be accurate.
  • the second text information is obtained by performing voice recognition according to the audio information in the multi-person communication process, so that the second text information tends to comprehensively record the content of multi-person communication. However, the second text information may have defects that are too long and not focused enough.
  • the recorder can modify the first text information according to the comparison between the first text information and the corresponding second text information. Therefore, the first text information can correct the problem of inaccurate expression and missing important content in the language with refined and prominent features, so that the first text information can be more perfect.
  • Embodiments of the present specification provide an information processing system.
  • the information processing system may include an audio collection terminal and a client; or an audio collection terminal, a client, and a server.
  • the audio collection terminal may be configured to record the voice of the user to generate audio information.
  • the audio information is provided to an audio information collection subsystem.
  • the audio capture terminal can be a stand-alone product with a housing and a microphone and data communication unit mounted in the housing.
  • the audio collection terminal can be a microphone.
  • the microphone converts the sound signal into an electrical signal to obtain audio information
  • the data communication unit can transmit the audio information to the processing subsystem.
  • the data communication unit may be a wired connection interface or a wireless communication module.
  • the data communication unit may be a Bluetooth module, a Wifi module, an audio interface, or the like.
  • the audio collection terminal can also be integrated into a client with certain data processing capabilities.
  • the client can include a smartphone, a tablet, a laptop, a personal digital assistant, and the like.
  • the client may primarily include hardware such as a processor, memory, display, and the like.
  • the client can have strong data processing capabilities. After the client can generate the feature matrix for the audio information, it can perform endpoint detection processing, noise reduction processing, voice recognition, and the like.
  • the client can be a workstation, a well-configured desktop or laptop.
  • the client can run the aforementioned audio information collection subsystem, shorthand subsystem, voice recognition subsystem, typesetting subsystem, and text correction subsystem.
  • the client may primarily include hardware such as a network communication unit, a processor, a memory, a display, and the like.
  • the client can send audio information to the server through the network communication unit.
  • the client can also perform certain processing on the audio information, such as generating a feature matrix, and sending the feature matrix to the server.
  • the server For the server to perform speech recognition and the like for the content in the audio information.
  • the client may include: a smartphone, a tablet, a desktop computer, a notebook computer, and the like.
  • the client can run the aforementioned audio information collection subsystem, shorthand subsystem, typesetting subsystem, and text correction subsystem.
  • the server may be an electronic device having a certain arithmetic processing capability. It may have a network communication terminal, a processor, a memory, and the like. Of course, the above server may also refer to software running in the electronic device.
  • the above server may also be a distributed server, and may be a system with multiple processors, a memory, a network communication module, and the like. Alternatively, the server can also be a server cluster formed by several servers.
  • the server can run the aforementioned speech recognition subsystem.
  • the server can also run a typesetting subsystem to send the typeset data to the client.
  • the client used by the clerk can be a desktop computer to record the statements of the judge, plaintiff, and lawyer during the trial.
  • An audio collection terminal including a microphone is provided at the seats of the judge, the plaintiff, and the court.
  • the audio collection terminal can communicate with the desktop computer through the Bluetooth communication technology, and input the recorded audio information to the desktop computer.
  • the clerk can use the keyboard to enter the first text message to the desktop computer.
  • the first text information may be content that the clerk hears in the language of the judge, the plaintiff or the court during the trial, or may be refined by the clerk according to the content expressed by the judge, the plaintiff or the court.
  • the plaintiff Zhang San said: "I borrowed 5,000 yuan from Li Si in April of the previous year. He said that he would return it to me in two months. As a result, after two months, I went to him and he went back. I’m dragging it. Until now, I went to him four times. He’s shirking money for various reasons.
  • the first textual information recorded by the clerk can be "the plaintiff Zhang San:” borrowed 5,000 yuan from Li Si in April 2015, and agreed to return it in June 2015. After the expiration, Zhang San repeatedly urged to repay the loan. Four retreats did not repay. Zhang San believes that Li Si has the ability to repay the arrears. '".
  • the client will record the time when the clerk begins to enter the first text message, such as 10:22:16.
  • the audio collection terminal set in the seat of Zhang San can record the voice of the three voices to form audio information. Input the audio information to the client.
  • the client can record the time when the audio information is received at 10:22:20.
  • the client can send the audio information, the receiving time, and the tag name "Speaker 1" of the audio collecting terminal to the server.
  • the server performs speech recognition and identification for the received audio information.
  • I got the second text message expressed in the audio message that is, "I lent Li Si 5000 yuan in April of the previous year. He said it was returned to me in two months. The result was two months. I went to him. He dragged back. I have been looking for him four times until now. He has shirked his money for various reasons. In fact, I know he has money. He bought a car last month.”
  • the server may correspond the second text information to the tag name "Speaker 1" of the audio collection terminal.
  • the second text message that is fed back to the client can be "Speaker 1: I lent Li Si 5000 yuan in April of the previous year.
  • the client receives the second textual information fed back by the server.
  • the time at which the second text information corresponds to the audio information is compared with the time of the first text information, and the temporally closer to the first text information and the second text information are displayed correspondingly.
  • the first text information and the second text information are respectively displayed in one window.
  • the first text information and the second text information are correspondingly displayed, and the corresponding first text information and the second text information may be displayed side by side in the horizontal direction.
  • the clerk uses the mouse to click the first text information or the second text information.
  • the client can play the corresponding audio information for generating the second text information.
  • the clerk can modify the first text information based on the second text information.
  • the clerk will "the plaintiff Zhang San:” borrowed 5,000 yuan from Li Si in April 2015, and agreed to return it in June 2015. After the expiration, Zhang San repeatedly urged to repay, Li Si pushed off There is no repayment. Zhang San believes that Li Si has the ability to repay the arrears. '".
  • the clerk can modify the first text information to "the plaintiff Zhang San:" in the April 2015 loan to Li Si 5000 yuan, agreed to return in June 2015, and after the expiration, Zhang San has four times. Remining to repay, Li Sijun shirked no repayment. Zhang San believes that Li Si has the ability to repay the arrears, Li Si also bought a car in June 2017. '".
  • the client brings convenience to the clerk to modify the first text information by correspondingly displaying the first text information and the second text information.
  • the clerk can finally print the court record formed by the first text message.
  • the court record is more comprehensive and accurate, and the language is refined and focused.
  • the client used by the meeting record person can be a laptop.
  • the notebook integrates a microphone, and the microphone can record the speech of the participants in the conference as audio information.
  • the meeting recorder can use the keyboard of the laptop to enter the first text message.
  • the first text information may be the content of the meeting participant's speech heard by the meeting record person, or may be the content of the meeting record person's summary or discussion according to the meeting participant's speech or discussion.
  • Wang Wu said: "Our project is urgent in time. We must keep an eye on the progress. We have to complete the project acceptance before the end of the year, so we must hurry and the R&D department can This month, we have a perfect design plan?". Ding Qi said: “We can't do it at the end of this month. We only have a preliminary design plan. After we need to purchase materials for preliminary verification, we can give a design with perfect price comparison.
  • the first text information recorded by the meeting record personnel may include “Wang Wu said: 'This project is more urgent in time, and each department needs to speed up the progress'”, “Qian Ba said: 'This month will complete the R&D department. "Requisition of raw materials needed”, "Ding Qi said: 'The Purchasing Department can provide a more complete design plan next month under the conditions of completing raw material procurement this month.”
  • the client records audio information according to the speech of the participants in the conference.
  • the client can perform speech recognition and identification of the audio information.
  • the second text information expressed in the audio information and the identity information corresponding to the second text information are obtained.
  • the conference participants can register at the client first, and the client records the audio information of the conference to generate a user feature vector identifying each participant and receives the input identity information.
  • the voice feature vector of the audio information can be matched with the user feature vector to obtain the identity information corresponding to the second text information.
  • the second text information that the client identifies based on the audio information may include: “Wang Wu said: 'Our project is urgent in time. We must keep an eye on the progress. We have to finish the project before the end of the year. Acceptance. So everyone should hurry up. Whether the R&D department can give a perfect design plan this month.'". “Ding Qi said: 'It’s not at the end of this month. We only have a preliminary design plan. We need to purchase materials for preliminary verification. It is possible to give a design with perfect price comparison. So it is time to provide a complete design plan. Time can complete the procurement of raw materials.'”. “Qian Ba said: ‘We can complete the procurement of raw materials at the end of this month.’”.
  • the client may obtain the second text information corresponding to the first text information according to the manner of semantic analysis.
  • the meeting record person can modify the first text information according to the content of the second text information. For example, “this project will be more urgent in time, and all departments need to speed up the progress”, and amend it to “this project is more urgent in time. To complete the project acceptance at the end of the year, all departments need to speed up the progress”, and “the procurement department will complete this month. Under the conditions of raw material procurement, a relatively complete design plan can be provided next month, and revised to "the procurement department can provide a relatively complete design plan in the middle of next month under the conditions of completing raw material procurement this month.”
  • the conference record formed by the first text information is emailed to the participating personnel.
  • Embodiments of the present specification provide an information processing method.
  • the information processing method can be applied to a client.
  • the information processing method may include the following steps.
  • Step S10 Receive first text information input by the first input device; the first text information is generated according to voice.
  • the first input device can be operated by the user to enable the client to receive the user's input.
  • the first input device may be a peripheral device connected to the client, such as a keyboard, a mouse, a tablet, or the like.
  • the first input device can also be a touch display that provides a user with a touch input function.
  • the first text information may be a text that the user expresses according to the individual's will according to the voice that is heard.
  • the conference recorder uses the text information input by the keyboard typing according to the content of the communication that everyone hears.
  • the first text information is received, and the first text information may be provided by the user to the client through the first input device. In this way, the first text information can be generated by the user according to the individual's will.
  • the first text information is generated according to the voice, and may be a text input by the user to the client according to the voice that the user hears under the control of the user's will. Therefore, the first text information may include the user's refinement of the content of the heard voice, may include content that the user understands, and may include direct explanation of the content that the user hears.
  • the first text information may be a sentence, or a natural segment.
  • the records formed by the recording personnel may include a plurality of first text information. Specifically, for example, recording a speech by a participant input by the person as a first text message; or, the recorder extracts a sentence as a first text message according to the viewpoint expressed by several participants; or The recorder will record the content of a participant's speech and divide it into several natural segments, each of which constitutes a first text message.
  • the functions and effects thereof are the same or similar to the present application, they should be covered by the present application.
  • Step S12 receiving audio information recorded by the second input device, wherein the audio information is generated according to the voice recording.
  • the second input device may be an audio collection terminal.
  • the second input device can be set in the microphone inside the client, or can be a microphone connected by the client by connecting peripherals.
  • a microphone is integrated in the notebook computer.
  • a desktop computer can be connected to an external microphone via a USB or Microphone interface.
  • step S10 and step S12 may be in no particular order.
  • both the plaintiff and the lawyer may have a microphone, and when the plaintiff or the lawyer speaks, the microphone can record the corresponding audio information.
  • the clerk inputs the first text information to the client through the keyboard according to the speech of the plaintiff or the court heard.
  • Step S14 Perform speech recognition on the audio information to obtain second text information.
  • the second text information may be content obtained by the client based on voice recognition.
  • the record formed by the second textual information can tend to fully record the conversational content of the speech in the audio information.
  • the second text information may be a sentence or a natural segment.
  • a second text information may be correspondingly generated according to an audio information; or a sentence may be formed as a second text information according to the recognized text; or, an adjacent context may be used to express a similar content.
  • a second text message or, combined with identity recognition, the content of one speech of a user as a second text message.
  • Step S16 Display the first text information and the second text information, wherein the content of the first text information and the second text information have a corresponding relationship.
  • the client can display the first text information and the second text information through the display, so that the user can conveniently consult. Since the second text information is generated by the voice recognition according to the audio information, the content of the audio information can be more comprehensively recorded, so that the user can modify the content of the first text information by referring to the second text information. Specifically, for example, during the trial or after the trial, the clerk can use the second text information obtained by the plaintiff or the lawyer's audio information to identify the first text information recorded by the clerk to make the first text. The information can be more accurate and bring convenience to the clerk.
  • the corresponding relationship may include: the first text information and the second text information that are generated at the same time are displayed in a relatively close position; or the first text information and the second text information that are generated at the same time Having the same display style; or, the first text information and the second text information tending to be generated at the same time have time tags that are close to each other; or audio information and first text information that tend to be generated at the same time, the audio information
  • the second text information is closer to the first text information display position; or the audio information and the first text information generated at the same time, the second text information of the audio information having the same as the first text information Display style; or, audio information and first text information generated at the same time, the second text information of the audio information has a time tag that is close to the first text information;
  • a text message and a second text message are displayed in close proximity or have the same display style.
  • the first text information and the second text information may be obtained by manually inputting and recording the audio after the voice recognition in the same voice scene.
  • the second text information can be used to comprehensively record the characteristics of the voice content, and the first text information is modified.
  • the first text information can be more comprehensive and accurate, and the emphasis can be emphasized, as well as the content of the summary recorded voice.
  • the first text information can have a smaller space than the second text information, and the emphasis is relatively prominent, which can save the reader's reading time.
  • the following sub-steps may be included in the step of voice recognition.
  • Step S20 Identify identity information corresponding to the audio information.
  • Step S22 Identify second text information of the voice expression in the audio information; the second text information is associated with the identity information.
  • the identity information may be displayed corresponding to the corresponding second text information in the step of displaying.
  • the feature matrix may be generated according to the audio information; further, the feature matrix is subjected to dimensionality reduction processing according to the plurality of feature dimensions to obtain a plurality of dimension values for characterizing the feature dimension, and the plurality of dimension values are formed.
  • the speech feature vector A speech feature vector can be used to identify each user.
  • the feature matrix may be subjected to dimensionality reduction processing according to different feature dimensions to obtain a dimension value that can represent each feature dimension. Further, the dimension values are arranged in the specified order to form a speech representation vector of the audio information.
  • the feature matrix may be subjected to dimensionality reduction processing by a convolution or mapping algorithm.
  • DNN Deep Neural Network
  • CNN Convolutional Neural Network
  • RNN Recurrent Neural Network
  • deep learning or a combination of the above algorithms may be used to perform dimensionality reduction from different dimensions in the feature matrix.
  • the collected audio information may be a recording of the user's speaking voice.
  • the speech representation vector generated from the audio information may correspond to the characterized audio information, and may also represent a portion of the user's sound characteristics. Because each user's growth and development process is different from each other, the sound of the user's speech has certain sound characteristics. Furthermore, different users can be distinguished by the voice characteristics of each user. As such, the speech characterization vector can be used to identify the user by characterizing a portion of the user's voice traits.
  • the audio information collected by the user may be one or more, and the corresponding audio feature vector may be generated by using an audio information processing method for each audio information.
  • the audio information processing method for each audio information.
  • more than one audio information may be simultaneously processed according to the audio information processing method to obtain a speech feature vector.
  • the speech feature vector may correspond to more than one audio information.
  • a user feature vector that can be used to identify a user can be determined according to the obtained voice feature vector. Specifically, for example, if only one speech feature vector is generated, the speech feature vector may be used as a user feature vector of the user; if a plurality of speech feature vectors are generated, one of the plurality of speech feature vectors may be selected. A speech feature vector that relatively expresses a user's voice characteristics is used as a user feature vector of the user; if a plurality of voice feature vectors are generated, part or all of the plurality of voice feature vectors may be further processed and outputted. User's user feature vector.
  • the arithmetic processing may include, but is not limited to, summing the corresponding dimensions for the plurality of speech feature vectors, and further calculating the mean.
  • summing the corresponding dimensions for the plurality of speech feature vectors may be included in the arithmetic processing.
  • weighted summation of multiple speech feature vectors may be included in the arithmetic processing.
  • the voice feature vector is matched with the user feature vector.
  • the match is successful, the personal information associated with the user feature vector is used as the identity information of the user.
  • the manner in which the speech feature vector is matched with the user feature vector may be performed according to the two, and when the relationship between the two is met, the matching may be considered successful.
  • the two are compared and summed, and the obtained value is used as a matching value, and the matching value is compared with a set threshold.
  • the matching value is less than or equal to the set threshold, the The speech feature vector is successfully matched with the user feature vector.
  • the voice feature vector may be directly summed with the user feature vector, and the obtained value is used as a matching value.
  • the matching value is greater than or equal to a set threshold, the voice feature vector and the sound are considered.
  • the user feature vector is successfully matched.
  • the personal information in the step of identifying the identity information of the user, may be entered after the audio information of the user is first collected. In this way, a speech feature vector can be generated based on the audio information, and the entered personal information is associated with the speech feature vector.
  • the implementation may associate the second text information obtained by the voice recognition with the identity information of the user who sent the voice, and then display the identity information and the second text information correspondingly when displaying. Specifically, the association may be a corresponding storage, or another may be determined according to one of the two.
  • the personal information may be a user's name, nickname or role, and the like. The character can be a plaintiff or a lawyer.
  • the implementation may associate the second text information obtained by the voice recognition with the identifier name of the user who sent the voice, and then display the label name and the second text information correspondingly when displaying.
  • the tag name may be "user 1".
  • the identity information is displayed corresponding to the second text information, so that the user can read, the personnel involved in the record, and the speaking status of the corresponding person.
  • the identity information may be displayed at the beginning or the end of the second text information.
  • the information processing method may further include the following steps.
  • Step S21 Send the audio information to a server, where the server determines identity information corresponding to the audio information.
  • Step S23 Receive identity information corresponding to the audio information fed back by the server.
  • the identity information is correspondingly displayed with the corresponding second text information.
  • the identification of the identity information corresponding to the audio information can be completed by the server.
  • the user can perform a registration operation in the server in advance, so that the identity information of the corresponding user in the server stores the user feature vector.
  • the server may generate a speech feature vector according to the received audio information and match the stored user feature vector to obtain a user feature vector corresponding to the speech feature vector, thereby obtaining identity information corresponding to the audio information.
  • the server may also have no pre-registered user feature vector, and the server may generate the voice feature vector of each audio information after receiving the audio information.
  • the plurality of speech feature vectors are clustered, and the speech feature vectors of the voices belonging to one user are aggregated in one data set, thereby obtaining a user feature vector that can represent the user.
  • the user's tag name can be assigned to the user feature vector according to the established naming rules.
  • Each user feature vector can correspond to a tag name.
  • the method may include: comparing the received time The approaching first text information corresponds to the audio information to correspond to the second text information that is displayed by the first text information and the audio information.
  • the receiving time when the client receives the first text information input by the user, the receiving time can be recorded.
  • the reception time of the audio information can also be recorded when the generated audio information is received. This reception time can also be understood as the generation time.
  • the first text information corresponds to the audio information
  • the content expressed by the first text information and the audio information tends to have the same semantics.
  • the first text information and the audio information tending to be the same received time have a greater possibility of expressing the same semantics.
  • the first text information and the second text information are correspondingly displayed to facilitate the comparison modification for the first text information or the second text information.
  • the meeting record person may record the content of the speech after the user hears a user, and generate the first text information.
  • the microphone can record the voice spoken by the user as audio information.
  • the time when the conference record person inputs the first text information and the time when the microphone records the generated voice can be relatively close, both of which are directed to the speech of one user in the conference. There is a correspondence between the first text information and the second text information.
  • the manner of displaying the first text information and the second text information may include: displaying the first text information and the second text information in a window in a relatively close position, for example, as two adjacent texts.
  • the first text information and the second text information may be displayed side by side in parallel; or the first text information and the second text information have the same display style, such as font, size, color, background color, and the like.
  • the method may include: The first text information is semantically matched in the second text information of the audio information generated in the specified time range, and the second text information corresponding to the first text information is obtained.
  • the first text information is input by the user according to the received voice
  • the second text information is obtained by the audio information for voice recognition. Therefore, the first text information and the second text information are all generated for the user's conversational voice in the same scene.
  • the first text information and the second text information are semantically identical.
  • the first text information is semantically analyzed and matched in the second text information to obtain second text information that is closer to the semantics of the first text information.
  • the obtained second text information whose semantics are relatively close is used as the second text information corresponding to the first text information.
  • the manner of performing semantic matching may include: cutting a word for the first text information, dividing a word included in the first text information, and comparing the word of the first text information with the second text information. Correct.
  • the second text information of the word including the most plurality of first text information is used as the second text information corresponding to the first text information; or the second text information including the synonyms of the most plurality of first text information words is used as the first text
  • the second text information corresponding to the information or the second text information including the words of the most plurality of first text information and the synonyms of the words as the second text information corresponding to the first text information.
  • a time range may be specified to match the first text information in the second text information of the audio information generated within the time range.
  • the time range may be closer to the time at which the first text information is generated, and the time range may include the time at which the first text information is generated, or may not include the time at which the first text information is generated.
  • the step of performing semantic matching further includes: setting a time of receiving the first text information as a reference time, and setting the specified time range according to the reference time, where the reference time is located in the Within the specified time range.
  • the specified time range may include a time for receiving the first text information, such that the time of the sound information is closer to the time of generating the first text information, and the speed of matching the second text information corresponding to the first text information may be improved. This saves time and reduces the amount of computation.
  • the information processing method may further include the following steps.
  • Step S26 Modify the first text information according to the input of the first input device.
  • the user may modify the first text information according to the content of the second text information.
  • the first text information can be further improved in accuracy and comprehensiveness on the basis of the characteristics of the language with manual recording, such as refinement and emphasis.
  • it also brings convenience to the producers of the records.
  • the embodiment may further include step S28: outputting the modified first text information.
  • the outputting the modified first text information may be to send the first text information to the printing device, or send the first text information by mail.
  • the information processing method may further include: playing the first text information or the second text information of the trigger event when the first text information or the second text information triggering event occurs Corresponding audio information.
  • the trigger event may be that the client receives a click operation or a swipe operation.
  • the user performs a click operation on the first text information or the second text information by using a mouse, or moves the mouse across the first text information or the second text information; or the user performs a touch click operation using the touch display device. .
  • the first text information or the second text information trigger event may be triggered by an area where the first text information or the second text information is displayed.
  • the play button may be provided corresponding to the first text information or the second text information, and when a click operation occurs on the play button, a trigger event occurs in the corresponding first text information or the second text information.
  • the audio information corresponding to the second text information may be audio information that is identified to obtain the second text information.
  • the information processing method may further include: displaying, in a case where the first text information is triggered, a second text information corresponding to the first text information according to a specified style; or In a case where the second text information occurrence triggering event, the first text information corresponding to the second text information is displayed in a specified style.
  • the specified style may include, but is not limited to, font, size, color, background color, bold effect, tilt effect, underline, and the like.
  • the corresponding second text information is displayed in a specified style, and the correspondence between the two can be relatively intuitive. It is also convenient for the user to make a comparison check.
  • the specified style can be different from the style of the rest of the text, so that the display state of the corresponding text information is distinguished from other characters, which brings convenience to the user.
  • the second text information trigger event occurs, the first text information is displayed in a specified style.
  • Embodiments of the present specification also provide an information processing system, which may include the following modules.
  • the first receiving module is configured to receive first text information input by the first input device; the first text information is generated according to voice.
  • the second receiving module is configured to receive audio information recorded by the second input device, where the audio information is generated according to the voice recording.
  • An identification module configured to perform voice recognition on the audio information to obtain second text information.
  • a display module configured to display the first text information and the second text information, wherein the content of the first text information and the second text information have a corresponding relationship.
  • the embodiment of the present specification further provides an information processing system, which may include: an input device, an audio collection terminal, a display, and a processor.
  • the input device is configured to receive first text information input by a user; the first text information is input by the user according to the voice.
  • the audio collection terminal is configured to record audio information; wherein the audio information is generated according to the voice recording.
  • the processor is configured to perform voice recognition on the audio information to obtain second text information.
  • the display is configured to display the first text information and the second text information, wherein a correspondence relationship exists between the first text information and the second text information.
  • the input device may be a device having an information input function.
  • the input device may be a keyboard, a tablet, a stylus, or the like.
  • the input device is not limited to the foregoing list.
  • the audio collection terminal includes a microphone.
  • the microphone may be an energy conversion device that converts a sound signal into an electrical signal.
  • the specific design of the microphone may include, but is not limited to, an electric type, a capacitive type, a piezoelectric type, an electromagnetic type, a carbon type, a semiconductor type, or the like.
  • the processor can be implemented in any suitable manner.
  • the processor can take the form of, for example, a microprocessor or processor and computer readable media, logic gates, switches, and special-purpose integrations for storing computer readable program code (eg, software or firmware) executable by the (micro)processor.
  • ASIC Application Specific Integrated Circuit
  • programmable logic controller programmable logic controller and embedded microcontroller form.
  • the display can provide an interface display.
  • the display may be divided into: a cathode ray tube display, a plasma display panel, a liquid crystal display, and a light emitting device according to different manufacturing materials. Diode Panel) and more.
  • the display is not limited to a flat display, but may also be a curved display, a stereo display, or the like.
  • Embodiments of the present specification also provide a computer storage medium.
  • the computer storage medium stores computer program instructions, and the computer program instructions are executed to: receive first text information input by the first input device, the first text information is generated according to voice; and receive the second input device Recording audio information, wherein the audio information is generated according to the voice recording; performing voice recognition on the audio information to obtain second text information; displaying the first text information and the second text information, wherein There is a corresponding relationship between the content in the first text information and the second text information.
  • the computer storage medium includes, but is not limited to, a random access memory (RAM), a read-only memory (ROM), a cache (Cache), a hard disk (Hard Disk Drive, HDD) or Memory Card.
  • RAM random access memory
  • ROM read-only memory
  • cache cache
  • HDD Hard Disk Drive
  • the embodiment of the present specification further provides an information processing method, and the information processing method may include the following steps.
  • Step S30 Receive first text information input by the first input device; the first text information is generated according to voice.
  • Step S32 Receive audio information recorded by the second input device, where the audio information is generated according to the voice recording.
  • Step S34 Send the audio information or the characterization information of the audio information to a server for voice recognition by the server.
  • Step S36 Receive second text information obtained by voice recognition fed back by the server.
  • Step S38 Display the first text information and the second text information, wherein the content of the first text information and the second text information have a corresponding relationship.
  • the client can send audio information to the server, and the server can perform voice recognition on the audio information.
  • the server may send the second text information obtained by the voice recognition to the client.
  • the client can display the first text information and the second text information.
  • the processing of the audio information by the server may be referred to other embodiments for comparison, and details are not described herein again.
  • the client and the server can perform data transmission based on a network communication protocol.
  • Network communication protocols include, but are not limited to, HTTP, TCP/IP, FTP, and the like.
  • the characterization information may be generated according to audio information, and may be used to represent the audio information.
  • the characterization information may be a feature matrix generated according to the audio information, or data after the endpoint detection process is performed on the feature matrix.
  • the voice recognition operation is performed by the server, thus reducing the computational load on the client. This reduces the hardware performance requirements for the client.
  • the embodiment of the present specification also provides an information processing system.
  • the information processing system may include the following modules.
  • the first receiving module is configured to receive first text information input by the first input device; the first text information is generated according to voice.
  • the second receiving module is configured to receive audio information recorded by the second input device, where the audio information is generated according to the voice recording.
  • a sending module configured to send the audio information or the characterization information of the audio information to a server for voice recognition by the server.
  • the third receiving module is configured to receive the second text information obtained by the voice feedback fed back by the server.
  • a display module configured to display the first text information and the second text information, wherein the content of the first text information and the second text information have a corresponding relationship.
  • the embodiment of the present specification also provides an information processing system.
  • the information processing system may include: an input device, a sound collection terminal, a network communication unit, and a display.
  • the input device is configured to receive first text information input by a user; the first text information is input by the user according to the voice.
  • the sound collection terminal is configured to record audio information; wherein the audio information is generated according to the voice recording.
  • the network communication unit is configured to send the audio information or the characterization information of the audio information to a server for voice recognition by the server, and receive second text information obtained by the voice feedback fed back by the server.
  • the display is configured to display the first text information and the second text information, wherein a correspondence relationship exists between the first text information and the second text information.
  • the network communication unit may be an interface for performing network connection communication, which is set according to a standard specified by a communication protocol.
  • the network communication unit can include, but is not limited to, a manner of wired communication and/or wireless communication.
  • Embodiments of the present specification also provide a computer storage medium storing computer program instructions.
  • the computer program instructions may be executed to: receive first text information input by the first input device, the first text information is generated according to the voice; and receive audio information recorded by the second input device, where the audio information And generating, according to the voice recording, sending the audio information or the characterization information of the audio information to a server for voice recognition by the server; and receiving second text information obtained by the voice feedback fed back by the server; And displaying the first text information and the second text information, wherein the content of the first text information and the second text information have a corresponding relationship.
  • the computer storage medium includes, but is not limited to, a random access memory (RAM), a read-only memory (ROM), a cache (Cache), a hard disk (Hard Disk Drive, HDD) or Memory Card.
  • RAM random access memory
  • ROM read-only memory
  • cache cache
  • HDD Hard Disk Drive
  • the embodiment of the present specification further provides an information processing method, and the information processing method may include the following steps.
  • Step S40 Receive the characterization information of the audio information or the audio information sent by the client.
  • Step S42 Perform speech recognition on the audio information or the characterization information to obtain second text information.
  • Step S44 Send the second text information to the client, where the client displays the second text information with the first text information of the first client; wherein the first text There is a correspondence between the information and the content in the second text information.
  • the server receives audio information or characterization information sent by the client through the network.
  • An algorithm model for speech recognition for audio information or characterization information may be provided in the server.
  • the server may correspond to multiple clients, so that the client does not need to perform voice recognition work, and the server has unified voice recognition. This reduces the hardware performance requirements for the client. Furthermore, the unified speech recognition work by the server facilitates the maintenance, update and upgrade of the speech recognition work.
  • the step of performing voice recognition by the server may include the following steps.
  • Step S20 Identify identity information corresponding to the audio information.
  • Step S22 Identify second text information of the voice expression in the audio information; the second text information is associated with the identity information.
  • the server may identify the identity information corresponding to the audio information, and the server may associate the second text information belonging to the same user with the identity information of the user. After the server puts the identity information into the second text information and provides the content to the client, the user can read the content displayed by the client.
  • the embodiment of the present specification further provides a server, which may include the following modules.
  • the receiving module is configured to receive the characterization information of the audio information or the audio information sent by the client.
  • an identifying module configured to perform voice recognition on the audio information or the characterization information to obtain second text information.
  • a sending module configured to send the second text information to the client, where the client displays the second text information and the first text information of the first client; There is a correspondence between a text message and content in the second text message.
  • the embodiment of the present specification further provides an electronic device, which may include: a network communication unit and a processor.
  • the network communication unit is configured to receive the characterization information of the audio information or the audio information sent by the client, and send the second text information provided by the processor to the client, where the client The second text information is displayed with the first text information of the first client; wherein the content of the first text information and the second text information have a corresponding relationship.
  • the processor is configured to perform voice recognition on the audio information or the characterization information to obtain second text information.
  • the embodiment of the present specification further provides a computer storage medium, wherein the computer storage medium stores computer program instructions, and the computer program instructions are executed to: receive the characterization information of the audio information or the audio information sent by the client; Transmitting the audio information or the characterization information to obtain second text information; and sending the second text information to the client, where the client sends the second text information to the first client
  • the first text information of the end is displayed, wherein the content of the first text information and the second text information have a corresponding relationship.
  • the embodiment of the present specification also provides an information processing method.
  • the information processing method may include the following steps.
  • Step S50 Receive first text information; the first text information is generated according to voice.
  • Step S52 Receive audio information; wherein the audio information is generated according to the voice recording.
  • Step S54 Perform speech recognition on the audio information to obtain second text information.
  • Step S56 Perform typesetting according to the correspondence between the first text information and the second text information, and the first text information and the second text information after the layout are used for display.
  • the electronic device that executes the information processing method can perform voice recognition on the audio information to obtain second text information. And further, the first text information and the second text information are typeset according to the correspondence between the first text information and the second text information.
  • the first text information and the second text information are typeset.
  • the manner of the specific typesetting can refer to the content of the foregoing typesetting subsystem, or other embodiments to explain.
  • An embodiment of the present disclosure further provides an electronic device, the electronic device includes a processor, the processor is configured to receive first text information, the first text information is generated according to a voice, and receive audio information, where the audio Generating information according to the voice recording; performing voice recognition on the audio information to obtain second text information; performing typesetting according to a correspondence between the first text information and the second text information, the typesetting The first text information and the second text information are used for presentation.
  • the embodiment of the present specification further provides a computer storage medium, where the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the first text information is received, and the first text information is generated according to voice.
  • Receiving audio information wherein the audio information is generated according to the voice recording; performing voice recognition on the audio information to obtain second text information; according to the first text information and the second text information
  • the correspondence is typeset, and the first text information and the second text information after the layout are used for presentation.
  • the embodiment of the present specification further provides an information processing method, and the method may include the following steps.
  • Step S60 Receive first text information input by the first input device; the first text information is generated according to voice.
  • Step S62 Receive audio information recorded by the second input device, where the audio information is generated according to the voice recording.
  • Step S64 Identify the audio information to obtain second text information.
  • Step S66 Display the first text information in the first area.
  • Step S68 The second text information is displayed in the second area, where the content of the first text information and the second text information have a corresponding relationship, and the first area and the second area are located in the same interface. .
  • an interface for display display can be divided into at least a first area and a second area.
  • the first area is used to display the first text information
  • the second area is used to display the second text information.
  • the first text information may be content input by the recording person according to the individual's understanding of the voice.
  • the first text information can be more generalized, the language is relatively concise, and the emphasis is relatively prominent.
  • the second text information may be content obtained by performing voice recognition on the audio information.
  • the second textual information can have a very comprehensive content.
  • the second text information may have relatively long content and less prominent points than the first text information.
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • HDL Hardware Description Language
  • the controller can be logically programmed by means of logic gates, switches, ASICs, programmable logic controllers, and embedding.
  • Such a controller can therefore be considered a hardware component, and the means for implementing various functions included therein can also be considered as a structure within the hardware component.
  • a device for implementing various functions can be considered as a software module that can be both a method of implementation and a structure within a hardware component.
  • the present specification can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present specification may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a disk.
  • An optical disk, etc. includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present specification or portions of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Business, Economics & Management (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Artificial Intelligence (AREA)
  • Technology Law (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本说明书实施方式公开了一种信息处理方法、系统、电子设备、和计算机存储介质。所述方法针对多人沟通对话的内容,将记录人员的记录信息,与语音识别得到的内容,进行对应展示。以便于记录人员可以完善记录的内容。

Description

信息处理方法、系统、电子设备、和计算机存储介质
本申请要求2017年07月19日递交的申请号为201710592686.1、发明名称为“信息处理方法、系统、电子设备、和计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本说明书涉及计算机技术领域,特别涉及一种信息处理方法、系统、电子设备、和计算机存储介质。
背景技术
现实的生活中,人们会在一起沟通,讨论事项。在沟通过程中通常会指定人员记录沟通的事项形成记录。具体的,例如,在工作过程中,多人进行的会议,需要进行会议记录;法院进行庭审之后,形成庭审记录等。
传统,人们多采用手写的方式进行记录,由于手写速度较慢,在纸张上书写的记录后期可编辑性较差,使得人们逐渐多采用电脑制作记录。比如,在庭审过程中,会由书记员操作电脑制作庭审记录。
发明内容
本说明书实施方式的目的是提供一种信息处理方法、电子设备和计算机存储介质。本说明书实施方式可以便于生成记录。
为实现上述目的,本说明书实施方式提供一种信息处理方法,所述方法包括:接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成;接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
本说明书实施方式还提供一种信息处理系统,所述信息处理系统包括:输入设备、音频采集终端、显示器和处理器;所述输入设备,用于接收用户输入的第一文本信息;所述第一文本信息为用户根据语音进行输入;所述音频采集终端,用于录制音频信息;其中,所述音频信息为根据所述语音录制生成;所述处理器,用于对所述音频信息进行语音识别得到第二文本信息;所述显示器,用于展示所述第一文本信息和所述第二文本 信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序指令,所述计算机程序指令被执行时实现:接收第一输入设备输入的第一文本信息,所述第一文本信息为根据语音生成;接收第二输入设备录制的音频信息,其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
本说明书实施方式还提供一种信息处理方法,所述方法包括:接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成;接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成;将所述音频信息或所述音频信息的表征信息发送给服务器,以用于所述服务器进行语音识别;接收所述服务器反馈的语音识别得到的第二文本信息;展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
本说明书实施方式还提供一种信息处理系统,所述信息处理系统包括:输入设备、音频采集终端、网络通信单元、显示器;所述输入设备,用于接收用户输入的第一文本信息;所述第一文本信息为用户根据语音进行输入;所述音频采集终端,用于录制音频信息;其中,所述音频信息为根据所述语音录制生成;所述网络通信单元,用于将所述音频信息或所述音频信息的表征信息发送给服务器,以用于所述服务器进行语音识别;接收所述服务器反馈的语音识别得到的第二文本信息;所述显示器,用于展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序指令,所述计算机程序指令被执行时实现:接收第一输入设备输入的第一文本信息,所述第一文本信息为根据语音生成;接收第二输入设备录制的音频信息,其中,所述音频信息为根据所述语音录制生成;将所述音频信息或所述音频信息的表征信息发送给服务器,以用于所述服务器进行语音识别;接收所述服务器反馈的语音识别得到的第二文本信息;展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
本说明书实施方式还提供一种信息处理方法,所述方法包括:接收客户端发送的音频信息或音频信息的表征信息;对所述音频信息或者所述表征信息进行语音识别得到第 二文本信息;将所述第二文本信息发送给所述客户端,以用于所述客户端将所述第二文本信息与第一客户端的第一文本信息进行展示;其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
本说明书实施方式还提供一种电子设备,所述电子设备包括:网络通信单元和处理器;所述网络通信单元,用于接收客户端发送的音频信息或音频信息的表征信息;将所述处理器提供的第二文本信息发送给所述客户端,以用于所述客户端将所述第二文本信息与第一客户端的第一文本信息进行展示;其中,所述第一文本信息和所述第二文本信息中内容存在对应关系;所述处理器,用于对所述音频信息或者所述表征信息进行语音识别得到第二文本信息。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质中存储有计算机程序指令,所述计算机程序指令被执行时可以实现:接收客户端发送的音频信息或音频信息的表征信息;对所述音频信息或者所述表征信息进行语音识别得到第二文本信息;将所述第二文本信息发送给所述客户端,以用于所述客户端将所述第二文本信息与第一客户端的第一文本信息进行展示,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
本说明书实施方式还提供一种信息处理方法,包括:接收第一文本信息;所述第一文本信息为根据语音生成;接收音频信息;其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;根据所述第一文本信息和所述第二文本信息之间的对应关系进行排版,所述排版后的所述第一文本信息和所述第二文本信息用于展示。
本说明书实施方式还提供一种电子设备,所述电子设备包括处理器,所述处理器用于接收第一文本信息,所述第一文本信息为根据语音生成;接收音频信息,其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;根据所述第一文本信息和所述第二文本信息之间的对应关系进行排版,所述排版后的所述第一文本信息和所述第二文本信息用于展示。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质中存储有计算机程序指令,所述计算机程序指令被执行时实现:接收第一文本信息,所述第一文本信息为根据语音生成;接收音频信息,其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;根据所述第一文本信息和所述第二文本信息之间的对应关系进行排版,所述排版后的所述第一文本信息和所述第二文本信息用 于展示。
本说明书实施方式还提供一种信息处理方法,所述方法包括:接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成;接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行识别得到第二文本信息;在第一区域展示所述第一文本信息;在第二区域展示所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系,所述第一区域和第二区域位于相同的界面中。
由以上本说明书实施方式提供的技术方案可见,通过针对同一个语音场景中,可以分别采用人工输入和录制音频后语音识别的方式得到第一文本信息和第二文本信息。利用第二文本信息可以较为全面的记录语音内容的特点,对第一文本信息进行修改。如此,使得第一文本信息可以较为全面准确,且可以突出重点,以及概要的记录语音的内容。第一文本信息相较于第二文本信息可以具有较小的篇幅,重点相对突出,可以节省阅读者的阅读时间。
附图说明
为了更清楚地说明本说明书实施方式或现有技术中的技术方案,下面将对实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本说明书实施方式提供的一种应用场景示意图;
图2为本说明书实施方式提供的一种信息处理方法的流程示意图;
图3为本说明书实施方式提供的一种界面的示意图;
图4为本说明书实施方式提供的一种界面的示意图;
图5为本说明书实施方式提供的一种应用场景示意图;
图6为本说明书实施方式提供的一种界面的示意图;
图7为本说明书实施方式听过的一种界面的示意图;
图8为本说明书实施方式提供的一种信息处理方法的流程示意图;
图9为本说明书实施方式提供的一种信息处理方法的流程示意图;
图10为本说明书实施方式提供的一种信息处理方法的流程示意图;
图11为本说明书实施方式提供的一种信息处理方法的流程示意图;
图12为本说明书实施方式提供的一种电子设备的模块示意图;
图13为本说明书实施方式提供的一种电子设备的架构示意图;
图14为本说明书实施方式提供的一种信息处理方法的流程示意图;
图15为本说明书实施方式提供的一种电子设备的模块示意图;
图16为本说明书实施方式提供的一种电子设备的架构示意图;
图17为本说明书实施方式提供的一种信息处理方法的流程示意图;
图18为本说明书实施方式提供的一种电子设备的模块示意图;
图19为本说明书实施方式提供的一种电子设备的架构示意图;
图20为本说明书实施方式提供的一种信息处理系统的示意图;
图21为本说明书实施方式提供的一种信息处理方法的流程示意图;
图22为本说明书实施方式提供的一种信息处理方法的流程示意图。
具体实施方式
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施方式中的附图,对本说明书实施方式中的技术方案进行清楚、完整地描述,显然,所描述的实施方式仅仅是本说明书一部分实施方式,而不是全部的实施方式。基于本说明书中的实施方式,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施方式,都应当属于本说明书保护的范围。
请参阅图20。本说明书实施方式提供一种信息处理系统。所述信息处理系统可以包括音频信息采集子系统、速记子系统、语音识别子系统、排版子系统和文本更正子系统。
在本实施方式中,音频信息采集子系统可以接收音频采集终端提供的音频信息,将音频信息提供给语音识别子系统。音频信息采集子系统可以针对音频信息进行初步筛选,以在存在多个音频采集终端时,避免将多个录制相同语音的音频信息发送给语音识别子系统。音频信息采集子系统可以将多个音频采集终端提供的音频信息,进行波形比较判断是否存在二个以上音频采集终端采集了趋于相同的音频信息。在发现二个以上音频采集终端提供的音频信息的波形趋于相同时,可以将波形能量最强的音频信息提供给语音识别子系统。
在本实施方式中,在设置有多个音频采集终端时,音频信息采集子系统可以针对每个音频采集终端对应设置一个标示名称。在将音频信息发送给语音识别子系统时,对应 发送采集该音频信息的音频采集终端的标示名称。如此,使得语音识别子系统可以将识别出的内容,与标示名称对应。再者,通常一个音频采集终端设置在一个座位,供位于该座位的用户使用。使得,该标示名称与用户之间存在对应关系。
在本实施方式中,音频采集子系统可以记录接收的音频信息的接收时间。在将音频信息发送给语音识别子系统时,可以将该接收时间一并发送给语音识别子系统。当然,并不限于将接收时间和音频信息一并发给语音识别子系统,还可以为将音频信息的生成时间和音频信息一并发送给语音识别子系统。
在本实施方式中,在一次录音过程中,音频采集子系统可以持续的将该次录音过程中的全部内容作为一个音频信息。也可以为,音频采集子系统在一次录音过程中,划分多个音频信息。比如,按照录音的时长划分音频信息。例如,每录制20毫秒为一个音频信息,形成一个音频信息。当然,音频信息可以不限于20毫秒,其具体时长可以选自20毫秒至500毫秒。或者,按照数据量进行划分音频信息。例如,每个音频信息最多5MB。或者,按照音频信息中声音波形的连续情况划分音频信息,比如在相邻两个连续的波形之间存在持续一定时长的无声部分,将该音频信息中每个连续的声音波形划分为一个音频信息。
在本实施方式中,所述速记子系统用于提供给记录人员进行输入使用,即记录人员可以根据听到的语音,在记录人员的意志控制下输入的针对语音的内容进行记录的第一文本信息。速记子系统可以接收记录人员操作输入设备所输入的第一文本信息。所述速记子系统可以将记录人员输入的第一文本信息提供给排版子系统,以用于排版子系统进行排版。
在本实施方式中,速记子系统可以记录第一文本信息的时间信息。该时间信息可以是第一文本信息开始输入的时间,也可以是第一文本信息完成输入的时间。速记子系统可以将该时间信息和第一文本信息一并提供给排版子系统。
在本实施方式中,第一文本信息中可以具有对应的发言人的名称。如此使得第一文本信息可以更加直观的表示发言人的身份和内容。具体的,例如,第一文本信息可以为“小明说:‘小张欠我100元钱……’”。
在本实施方式中,语音识别子系统可以针对音频信息进行语音识别,得出表示音频信息中语音的第二文本信息。
在本实施方式中,语音识别子系统可以根据预设算法,从音频信息中采集数据,输 出包括所述音频信息的音频数据的特征的特征矩阵。用户的声音会有用户自身的特征,比如音色、语调、语速等等。录制成音频信息时,可以从音频数据中的频率、振幅等角度,体现每个用户自身的声音特征。使得将音频信息按照预设算法生成的特征矩阵,会包括音频信息中音频数据的特征。进而,基于特征矩阵生成的语音特征向量,可以用于表征该音频信息和音频数据。所述预设算法可以是MFCC(Mel Frequency Cepstrum Coefficient)、MFSC(Mel Frequency Spectral Coefficient)、FMFCC(Fractional Mel Frequency Cepstrum Coefficient)、DMFCC(Discriminative)、LPCC(Linear Prediction Cepstrum Coefficient)等。当然,所属领域技术人员在本说明书技术精髓启示下,还可能采用其它算法实现生成音频信息的特征矩阵,但只要其实现的功能和效果与本说明书方式相同或相似,均应涵盖于本说明书保护范围内。
在本实施方式中,为了进一步区分出音频信息中用户语音的音频数据和非用户语音的音频数据。语音特征向量的生成方法中,还可以包括端点检测处理。进而,可以在特征矩阵中减少将非用户语音的音频数据对应的数据,如此,可以在一定程度上提升生成的语音特征向量与用户之间的关联程度。端点检测处理的方法可以包括但不限于基于能量的端点检测、基于倒谱特征的端点检测、基于信息熵的端点检测、基于自身相关相似距离的端点检测等,在此不再列举。
在本实施方式中,语音识别子系统可以采用语音识别算法对特征矩阵进行处理,得出音频信息中表达的第二文本信息。具体的,例如,语音识别算法可以采用隐式马尔科夫算法或神经网络算法等,对音频信息进行语音识别。
在本实施方式中,语音识别子系统识别出第二文本信息之后,可以将识别的第二文本信息提供给排版子系统。以用于所述排版子系统将第二文本信息和第一文本信息对应排版。在一些情况下,语音识别子系统接收到语音采集子系统提供的与音频信息对应的时间信息和/或标示名称。语音识别子系统可以将时间信息和/或标示名称,与第二文本信息一并提供给排版子系统。
在本实施方式中,所述排版子系统可以对接收的第一文本信息和第二文本信息进行排版。使得存在对应关系的第一文本信息和第二文本信息可以在显示时被对应展示。对应关系包括但不限于:趋于相同时间生成的音频信息和第一文本信息,该音频信息的第二文本信息与所述第一文本信息显示位置较为接近;或者,趋于相同时间生成的音频信息和第一文本信息,该音频信息的第二文本信息与所述第一文本信息具有相同的显示样 式;或者,趋于相同时间生成的音频信息和第一文本信息,该音频信息的第二文本信息与所述第一文本信息有趋于接近的时间标签。
在本实施方式中,所述文本更正子系统可以用于接收记录人员对第一文本信息的修改,从而形成最终文本。第一文本信息是记录人员在多人沟通的过程中,根据听到的内容进行快速输入的文本信息。限于记录人员的输入速度,和对多人沟通的理解能力,第一文本信息的内容可能会比较精炼,但可能不够全面。在一些情况下,可能漏掉一些较为重要的内容,或者存在一些内容表达准确。第二文本信息是根据多人沟通过程中的音频信息进行语音识别得到,使得第二文本信息趋于全面记录了多人沟通的内容。但第二文本信息可能存在篇幅过长,重点不够突出的缺陷。如此,在排版子系统针对第一文本信息和第二文本信息进行排版之后,记录人员可以根据第一文本信息和对应的第二文本信息进行比较,修改第一文本信息。使得,第一文本信息可以在具有语言精练,重点突出的特点上,修正可能的表达不准确和漏记录重要内容的问题,使得第一文本信息可以更加的完善。
本说明书实施方式提供一种信息处理系统。所述信息处理系统可以包括音频采集终端和客户端;或者,音频采集终端、客户端和服务器。
在本实施方式中,所述音频采集终端可以用于将用户的语音录制生成音频信息。将所述音频信息提供给音频信息采集子系统。音频采集终端可以是一个独立的产品,其具有外壳,以及在外壳中安装有传声器和数据通信单元。例如,音频采集终端可以是一个麦克风。传声器将声音信号转换成电信号,得到音频信息,数据通信单元可以将音频信息发送至处理子系统。数据通信单元可以是有线连接的接口,也可以是无线通信的模块。具体的,例如,数据通信单元可以是蓝牙模块、Wifi模块或者音频接口等。当然,音频采集终端还可以集成在一个具有一定数据处理能力的客户端中。比如,客户端可以包括智能手机、平板电脑、笔记本电脑、个人数字助理等。
在一个实施方式中,客户端可以主要包括处理器、存储器、显示器等硬件。客户端可以具有较强的数据处理能力。客户端可以对音频信息生成特征矩阵之后,可以进行端点检测处理、降噪处理、语音识别等。具体的,例如,客户端可以是工作站、配置很好的台式电脑或笔记本电脑。
在本实施方式中,客户端可以运行前述音频信息采集子系统、速记子系统、语音识别子系统、排版子系统和文本更正子系统。
在一个实施方式中,客户端可以主要包括网络通信单元、处理器、存储器、显示器等硬件。客户端可以将音频信息通过网络通信单元发送给服务器。客户端也可以对音频信息进行一定处理,比如生成特征矩阵,将特征矩阵发送给服务器。以用于服务器针对音频信息中的内容进行语音识别等。具体的,例如,客户端可以包括:智能手机、平板电脑、台式电脑、笔记本电脑等。
在本实施方式中,客户端可以运行前述音频信息采集子系统、速记子系统、排版子系统和文本更正子系统。
当然,上述只是示例的方式列举了一些客户端。随着科学技术进步,硬件设备的性能可能会有提升,使得目前数据处理能力较弱的电子设备,也可能具备较佳的处理能力。
在本实施方式中,服务器可以是具有一定运算处理能力的电子设备。其可以具有网络通信端子、处理器和存储器等。当然,上述服务器也可以是指运行于所述电子设备中的软体。上述服务器还可以为分布式服务器,可以是具有多个处理器、存储器、网络通信模块等协同运作的系统。或者,服务器还可以为若干服务器形成的服务器集群。
在本实施方式中,服务器可以运行前述语音识别子系统。服务器也可以运行排版子系统,将排版后的数据发送给客户端。
请一并参阅图1和图2。在一个场景示例中,在法院庭审的场景中,书记员使用的客户端可以为台式电脑,以记录庭审过程中法官、原告和被告的发言内容。在法官、原告和被告的座位设置有包括传声器的音频采集终端。音频采集终端可以通过蓝牙通信技术的方式与台式电脑通信,将录制的音频信息输入给台式电脑。
请参阅图3。在本场景示例中,书记员可以使用键盘向台式电脑输入第一文本信息。该第一文本信息可以是书记员听到庭审过程中法官、原告或被告语言表达的内容,也可以是书记员根据法官、原告或被告语言表达的内容进行提炼后的内容。例如,原告张三说:“我在前年四月份借给李四5000元钱,他说过两个月就还给我。结果,过了两个月,我去找他要,他就往后拖着。一直到现在,我去找他四次,他都以各种理由推脱不还钱。实际上,我知道他有钱,他上个月还买了一辆汽车。”。书记员记录的第一文本信息可以为“原告张三:‘在2015年4月份借给李四5000元,约定2015年6月份归还,到期 后经张三反复几次催要还款,李四推脱没有还款。张三认为李四有能力偿还欠款。’”。客户端会记录书记员开始输入第一文本信息的时间,具体的,比如10:22:16。
在本场景示例中,设置在张三的座位的音频采集终端,可以录制张三说话的语音形成音频信息。将该音频信息输入给客户端。客户端可以记录接收到音频信息的时间10:22:20。客户端可以将该音频信息、接收时间和音频采集终端的标示名称“发言者1”发送给服务器。
在本场景示例中,服务器针对接收到的音频信息进行语音识别和身份识别。得出音频信息中表达的第二文本信息,即“我在前年四月份借给李四5000元钱。他说过两个月就还给我。结果过了两个月。我去找他要。他就往后拖着。一直到现在我去找他四次。他都以各种理由推脱不还钱。实际上。我知道他有钱。他上个月还买了一辆汽车。”服务器可以将该第二文本信息与音频采集终端的标示名称“发言者1”相对应。反馈给客户端的第二文本信息可以为“发言者1:我在前年四月份借给李四5000元钱。他说过两个月就还给我。结果过了两个月。我去找他要。他就往后拖着。一直到现在我去找他四次。他都以各种理由推脱不还钱。实际上。我知道他有钱。他上个月还买了一辆汽车。”。服务器可以将接收时间和第二文本信息一并反馈给客户端。
在本场景示例中,客户端接收到服务器反馈的第二文本信息。将该第二文本信息对应音频信息的时间,与第一文本信息的时间进行比对,将时间上较为接近第一文本信息和第二文本信息对应显示。
在本场景示例中,将第一文本信息和第二文本信息分别展示在一个窗口中。第一文本信息和第二文本信息对应显示,可以为相对应的第一文本信息和第二文本信息横向上并排显示。
在本场景示例中,在第一文本信息或第二文本信息发生触发事件时,比如书记员使用鼠标点击第一文本信息或第二文本信息。客户端可以播放对应的用于生成第二文本信息的音频信息。
请参阅图4。在本场景示例中,书记员可以根据第二文本信息修改第一文本信息。在最初书记员输入的记录中,可能因为时间较为紧张,漏记录了一些内容,可以根据第二文本信息进行修改。具体的,例如,书记员将“原告张三:‘在2015年4月份借给李四5000元,约定2015年6月份归还,到期后经张三反复几次催要还款,李四推脱没有还款。张三认为李四有能力偿还欠款。’”。书记员记录的第一文本信息中,相较于语音识别得到的第二文本信息,漏记了张三向李四催要还款的次数,以及李四有经济能力 购买小汽车的事项。书记员可以对照第二文本信息,将第一文本信息修改为“原告张三:‘在2015年4月份借给李四5000元,约定2015年6月份归还,到期后经张三先后四次催要还款,李四均推脱没有还款。张三认为李四有能力偿还欠款,李四在2017年6月份还购买了一辆汽车。’”。
在本场景示例中,客户端通过对应展示第一文本信息和第二文本信息,给书记员修改第一文本信息带来了便利。书记员可以最终打印由第一文本信息形成的庭审记录。如此,使得庭审记录较为全面准确,并且语言精练、重点突出。
请一并参阅图5。在一个场景示例中,在一个会议中,会议记录人员使用的客户端可以为笔记本电脑。该笔记本电脑中集成有传声器,通过传声器可以将会议过程中参与人员的发言录制成音频信息。
在本场景示例中,会议记录人员可以使用笔记本电脑的键盘输入第一文本信息。第一文本信息可以是会议记录人员听到的会议参与人发言的内容,也可以是会议记录人员根据会议参与人的发言或讨论,进行总结的内容。具体的,例如,在会议讨论中,王五说:“我们这个项目时间上比较紧急,大家要盯紧进度。我们要赶在年底前完成项目验收,所以大家要抓紧时间,研发部是否可以在本月就给出完善的设计方案?”。丁七说:“本月底不行,我们现在只有初步的设计方案,需要采购材料进行初步验证之后,才可能给出比价完善的设计方案。所以具体提供完善的设计方案的时间,要看采购什么时间能够完成原材料采购。”。钱八说:“我们本月底可以完成原材料的采购。”。丁七说:“如果本月底原材料可以到位的话,我们下个月中旬应该可以提供较为完善的设计方案”。王五说:“好的,那我们就定下来本月底采购部完成原材料采购,下个月中旬研发部提供较为完善的设计方案。”。
请一并参阅图6。在本场景示例中,会议记录人员记录的第一文本信息可以包括“王五说:‘本项目时间上较为紧急,需要各部门加快进度’”,“钱八说:‘本月会完成研发部需要的原材料的采购’”,“丁七说:‘采购部在本月完成原材料采购的条件下,下个月可以提供较为完善的设计方案’”。
在本场景示例中,客户端根据会议过程中参与人员的发言录制成音频信息。客户端可以对音频信息进行语音识别和身份识别。得出音频信息中表达的第二文本信息和第二文本信息对应的身份信息。在会议开始时,会议参与人可以先在客户端进行注册,客户端录制会议的音频信息生成标识每个参与人的用户特征向量,并接收输入的身份信息。 如此,在会议过程中可以根据音频信息的语音特征向量与用户特征向量进行匹配得出第二文本信息对应的身份信息。
在本场景示例中,客户端根据音频信息识别得出的第二文本信息可以包括:“王五说:‘我们这个项目时间上比较紧急。大家要盯紧进度。我们要赶在年底前完成项目验收。所以大家要抓紧时间。研发部是否可以在本月就给出完善的设计方案。’”。“丁七说:‘本月底不行。我们现在只有初步的设计方案。需要采购材料进行初步验证之后。才可能给出比价完善的设计方案。所以具体提供完善的设计方案的时间。要看采购什么时间能够完成原材料采购。’”。“钱八说:‘我们本月底可以完成原材料的采购。’”。“丁七说:‘如果本月底原材料可以到位的话。我们下个月中旬应该可以提供较为完善的设计方案。’”。“王五说:‘好的。那我们就定下来本月底采购部完成原材料采购。下个月中旬研发部提供较为完善的设计方案。’”。
请一并参阅图7。在本场景示例中,客户端可以根据语义分析的方式,得出第一文本信息对应的第二文本信息。会议记录人员可以根据第二文本信息的内容,修改第一文本信息。比如,将“本项目时间上较为紧急,需要各部门加快进度”,修改为“本项目时间上较为紧急,要在年底完成项目验收,需要各部门加快进度”,将“采购部在本月完成原材料采购的条件下,下个月可以提供较为完善的设计方案”,修改为“采购部在本月完成原材料采购的条件下,下个月中旬可以提供较为完善的设计方案”。
在本场景示例中,会议记录人员完成对会议记录中第一文本信息的修改之后,将第一文本信息形成的会议记录通过电子邮件发送给参会的人员。
请参阅图8。本说明书实施方式提供一种信息处理方法。所述信息处理方法可以应用于客户端。所述信息处理方法可以包括以下步骤。
步骤S10:接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成。
在本实施方式中,第一输入设备可以供用户操作,以实现客户端接收用户的输入。第一输入设备可以是与客户端连接的外设,例如键盘、鼠标、手写板等。第一输入设备也可以是触摸式显示器,其提供给用户触摸输入功能。
在本实施方式中,第一文本信息可以是用户根据听到的语音,按照个人意志表达的文字。具体的,例如,在多人会议过程中,会议记录人员根据听到的大家交流的内容,使用键盘打字输入的文字信息。
在本实施方式中,接收第一文本信息,可以为第一文本信息是用户通过第一输入设备提供给客户端。如此,使得第一文本信息可以是用户根据个人意志所生成的。
在本实施方式中,第一文本信息为根据语音生成,可以为用户根据其听到的语音,在用户意志的支配下向客户端输入的文字。使得,第一文本信息可以包括用户对听到语音的内容的提炼,可以包括用户自身理解的内容,可以包括用户对听到内容的直接阐述等。
在本实施方式中,第一文本信息可以是一句话,或者一个自然段。在一次多人的沟通过程中,可能会有多个人说话,记录人员最终形成的记录中可以包括多条第一文本信息。具体的,例如,记录人员输入的一个参会人员的一次发言,作为一个第一文本信息;或者,记录人员根据几个参会人员表达的观点提炼出一句话作为一个第一文本信息;或者,记录人员将记录一个参会人员发言的内容,分成几个自然段,每个自然段构成一个第一文本信息。当然,所属领域技术人员在本实施方式技术精髓启示下,还可以做出其它变更,但只要其实现的功能和效果,与本申请相同或相似,均应涵盖于本申请保护范围内。
步骤S12:接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成。
在本实施方式中,第二输入设备可以为音频采集终端。第二输入设备可以设置在客户端内的传声器,也可以为客户端通过连接外设的方式连接的传声器。具体的,例如,笔记本电脑中集成有传声器。再例如,台式电脑可以通过USB或者Microphone接口,连接外接的传声器。
在本实施方式中,步骤S10和步骤S12可以不分先后顺序。具体的,例如,在法院的庭审中,原告和被告都可以有传声器,在原告或者被告发言时,传声器可以录制相应的音频信息。在这个过程中,书记员根据听到的原告或被告的发言,通过键盘向客户端输入第一文本信息。
步骤S14:对所述音频信息进行语音识别得到第二文本信息。
在本实施方式中,第二文本信息可以是客户端基于语音识别得到的内容。使得,第二文本信息形成的记录可以趋于全面记录音频信息中语音的对话内容。
在本实施方式中,第二文本信息可以是一句话或者一个自然段。具体的,例如,可 以根据一个音频信息,对应生成一个第二文本信息;或者,根据识别出的文字形成一句话,作为一个第二文本信息;或者,将相邻上下文,表达相接近的内容,作为一个第二文本信息;或者,结合身份识别,将一个用户的一次发言的内容,作为一个第二文本信息。
步骤S16:展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
在本实施方式中,客户端可以通过显示器展示第一文本信息和第二文本信息,如此可以便于用户查阅。由于第二文本信息为根据音频信息进行语音识别生成,其能够较为全面记录音频信息的内容,使得用户可以借鉴第二文本信息针对第一文本信息的内容进行修改。具体的,例如,在庭审过程中或者庭审后,书记员可以利用原告或被告的音频信息语音识别得到的第二文本信息,针对书记员自己记录的第一文本信息进行修改,以使得第一文本信息可以更加的准确,给书记员带来了便利。
在本实施方式中,所述对应关系可以包括:趋于相同时间生成的第一文本信息和第二文本信息显示位置较为接近;或者,趋于相同时间生成的第一文本信息和第二文本信息具有相同的显示样式;或者,趋于相同时间生成的第一文本信息和第二文本信息具有趋于接近的时间标签;或者,趋于相同时间生成的音频信息和第一文本信息,该音频信息的第二文本信息与所述第一文本信息显示位置较为接近;或者,趋于相同时间生成的音频信息和第一文本信息,该音频信息的第二文本信息与所述第一文本信息具有相同的显示样式;或者,趋于相同时间生成的音频信息和第一文本信息,该音频信息的第二文本信息与所述第一文本信息有趋于接近的时间标签;趋于表达相同语义的第一文本信息和第二文本信息显示位置较为接近,或者,具有相同显示样式。
本说明书实施方式,通过针对同一个语音场景中,可以分别采用人工输入和录制音频后语音识别的方式得到第一文本信息和第二文本信息。利用第二文本信息可以较为全面的记录语音内容的特点,对第一文本信息进行修改。如此,使得第一文本信息可以较为全面准确,且可以突出重点,以及概要的记录语音的内容。第一文本信息相较于第二文本信息可以具有较小的篇幅,重点相对突出,可以节省阅读者的阅读时间。
请参阅图9,在一个实施方式中,在语音识别的步骤中可以包括以下子步骤。
步骤S20:识别所述音频信息对应的身份信息。
步骤S22:识别所述音频信息中的语音表达的第二文本信息;所述第二文本信息与所述身份信息相关联。
在所述展示的步骤中可以将所述身份信息与对应的第二文本信息对应展示。
在本实施方式中,可以根据音频信息生成特征矩阵;进而,将所述特征矩阵按照多个特征维度进行降维处理,得到多个用于表征特征维度的维度值,所述多个维度值形成所述语音特征向量。可以使用语音特征向量来标识每个用户。
在本实施方式中,可以对所述特征矩阵按照不同的特征维度进行降维处理,得到可以表征每个特征维度的维度值。进一步的,将维度值按照指定顺序排列便可以形成音频信息的语音表征向量。具体的,可以通过卷积或者映射的算法对特征矩阵进行降维处理。在一个具体的示例中,可以采用DNN(Deep Neural Network)、CNN(Convolutional Neural Network)和RNN(Recurrent Neural Network)、深度学习或者上述算法的结合等,从特征矩阵中按照不同维度进行降维。
在本实施方式中,采集的音频信息可以是用户的说话声音的录音。如此,根据音频信息生成的语音表征向量,可以对应表征的音频信息,也可以表征用户的一部分声音特质。由于每个用户生长发育过程,是各部不相同的,使得用户说话的声音,都具有一定的声音特质。进而,可以通过每个用户的声音特质分不同的用户。如此,语音表征向量可以通过表征用户的一部分声音特质,而可以用于标识用户。
在本实施方式中,针对用户采集的音频信息可以是一个或多个,可以对应每个音频信息采用音频信息处理方法生成对应的语音特征向量。当然,在一些情况下,也可以将一个以上音频信息同时进行按照音频信息处理方法进行运算处理,得到语音特征向量。此时,该语音特征向量可以对应该一个以上音频信息。
在本实施方式中,可以根据得到的语音特征向量,确定可以用于标识用户的用户特征向量。具体的,例如,若仅生成了一个语音特征向量,则可以将该语音特征向量作为用户的用户特征向量;若生成了多个语音特征向量,可以将在该多个语音特征向量中,选择一个相对表达用户的声音特质较多的语音特征向量,作为用户的用户特征向量;若生成了多个语音特征向量,还可以为将该多个语音特征向量中的部分或全部,进行进一步运算处理输出用户的用户特征向量。该运算处理可以包括但不限于针对该多个语音特征向量进行相应维度求和之后,再进一步计算均值。当然,还可以有其它算法,比如,运算处理时对多个语音特征向量的加权求和。
在本实施方式中,将语音特征向量与用户特征向量进行匹配,在匹配成功时,将所述用户特征向量关联的个人信息作为所述用户的身份信息。具体的,将语音特征向量与用户特征向量进行匹配的方式,可以为根据二者进行运算,在二者之间符合某种关系时,可以认为匹配成功。具体的,例如,将二者做差后求和,将得到的数值作为匹配值,将该匹配值与一个设定阈值比较,在所述匹配值小于或等于设定阈值的情况下认为所述语音特征向量与所述用户特征向量匹配成功。或者,也可以将所述语音特征向量与所述用户特征向量直接求和,将得到的数值作为匹配值,在所述匹配值大于或等于设定阈值的情况下认为所述语音特征向量与所述用户特征向量匹配成功。
在本实施方式中,在识别用户的身份信息的步骤中,可以为先采集用户的音频信息之后,录入个人信息。如此,可以根据音频信息生成语音特征向量,进而将录入的个人信息与语音特征向量关联。实现可以将语音识别得到的第二文本信息与发出该语音的用户的身份信息相关联,进而在展示时,可以对应展示身份信息和第二文本信息。具体的,所述关联可以为对应存储,或者根据二者中的一个可以确定另一个。具体的,例如,个人信息可以是用户的姓名、昵称或者角色等。所述角色可以为原告或被告。当然,也可以无需录入个人信息,自动为用户分配一个标示名称,将该标示名称与语音特征向量关联。实现可以将语音识别得到的第二文本信息与发出该语音的用户的标示名称相关联,进而在展示时,可以对应展示标示名称和第二文本信息。具体的,例如,标示名称可以为“用户1”。
在本实施方式中,将身份信息与第二文本信息对应展示,如此便于用户阅读,记录中涉及的人员,以及对应人员的发言情况。对应展示的方式,可以为将身份信息显示在第二文本信息的开始或者末尾。
请参阅图10,在一个实施方式中,所述信息处理方法还可以包括以下步骤。
步骤S21:将所述音频信息发送给服务器,以用于所述服务器确定所述音频信息对应的身份信息。
步骤S23:接收所述服务器反馈的所述音频信息对应的身份信息。
相应的,在展示的步骤中,将所述身份信息与对应的第二文本信息对应展示。
在本实施方式中,可以由服务器完成对音频信息对应的身份信息的识别。用户可以预先在服务器中进行注册操作,使得服务器中对应用户的身份信息存储有用户特征向量。服务器可以根据接收的音频信息生成语音特征向量与存储的用户特征向量进行匹配,得 出语音特征向量对应的用户特征向量,进而得到音频信息对应的身份信息。当然,服务器中也可以没有预先注册的用户特征向量,服务器可以接收音频信息之后,生成每个音频信息的语音特征向量。针对多个语音特征向量进行聚类,将属于一个用户的语音的语音特征向量聚合在一个数据集中,进而可以得出可以表征用户的用户特征向量。针对每个用户的身份信息,可以按照制定的命名规则为用户特征向量分配用户的标示名称。每个用户特征向量可以对应一个标示名称。
在一个实施方式中,在展示所述第一文本信息和所述第二文本信息的步骤中或者对第一文本信息和第二文本信息进行排版的步骤中,可以包括:将接收到的时间较为接近的第一文本信息和音频信息相对应,以对应展示所述第一文本信息和所述音频信息被识别得出的第二文本信息。
在本实施方式中,客户端接收到用户输入的第一文本信息时,可以记录接收时间。在接收到生成的音频信息时,也可以记录音频信息的接收时间。该接收时间也可以理解为生成时间。
在本实施方式中,第一文本信息和音频信息相对应,可以为第一文本信息和音频信息所表达的内容趋于具有相同的语义。具体的,趋于相同接收的时间的第一文本信息和音频信息,有较大可能性表达相同的语义。如此,将第一文本信息和第二文本信息对应展示,以便于针对第一文本信息或者第二文本信息进行对照修改。具体的,例如,多个用户在开会沟通过程中,会议记录人员可能在听到一个用户说话之后,记录说话的内容,生成第一文本信息。此时,传声器可以将该用户说话的语音录制成音频信息。可见,会议记录人员输入第一文本信息的时间,和传声器录制生成语音的时间可以较为接近,二者都针对该会议中一个用户的发言。使得第一文本信息和第二文本信息之间存在对应。
在本实施方式中,对应展示第一文本信息和第二文本信息的方式可以包括:在一个窗口中,显示第一文本信息和第二文本信息的位置较为接近,比如作为相邻的两段文字;在两个窗口中,可以横向并列展示第一文本信息和第二文本信息;或者,第一文本信息和第二文本信息具有相同的显示样式,比如字体、大小、颜色、背景色等。
在一个实施方式中,在展示所述第一文本信息和所述第二文本信息的步骤中或者在对所述第一文本信息和所述第二文本信息进行排版的步骤中,可以包括:将所述第一文本信息在指定时间范围内生成的音频信息的第二文本信息中,进行语义匹配,得到与所 述第一文本信息对应的第二文本信息。
在本实施方式中,第一文本信息是用户根据听到的语音进行输入,而第二文本信息是音频信息进行语音识别得到的。所以,第一文本信息和第二文本信息,都是针对同一场景中,用户的对话语音生成。如此,第一文本信息和第二文本信息在语义上,存在一定的一致性。将第一文本信息在第二文本信息中进行语义分析匹配,得出与第一文本信息语义较为接近的第二文本信息。将得到的语义较为接近的第二文本信息作为与所述第一文本信息对应的第二文本信息。
在本实施方式中,进行语义匹配的方式可以包括:针对第一文本信息进行切词,划分出第一文本信息中包括的词语,将第一文本信息被划分的词语与第二文本信息进行比对。将包括最多个第一文本信息的词语的第二文本信息作为与第一文本信息对应的第二文本信息;或者,将包括最多个第一文本信息词语的近义词的第二文本信息作为第一文本信息对应的第二文本信息;或者,将包括最多个第一文本信息的词语和词语的近义词的第二文本信息,作为与第一文本信息对应的第二文本信息。
在本实施方式中,为了减少运算负荷,可以指定一个时间范围,将第一文本信息在位于该时间范围内生成的音频信息的第二文本信息中进行匹配。该时间范围可以与生成第一文本信息的时间较为接近,该时间范围可以包括生成第一文本信息的时间,也可以不包括生成第一文本信息的时间。在一个实施方式中,在进行语义匹配的步骤中还包括:将接收到所述第一文本信息的时间作为基准时间,根据所述基准时间设置所述指定时间范围,所述基准时间位于所述指定时间范围内。如此,指定时间范围可以包括接收第一文本信息的时间,使得声音信息的时间,与生成第一文本信息的时间较为接近,可以提升匹配到与第一文本信息对应的第二文本信息的速度,进而节省时间和减少运算量。
请参阅图11。在一个实施方式中,所述信息处理方法还可以包括以下步骤。
步骤S26:根据所述第一输入设备的输入对所述第一文本信息进行修改。
在本实施方式中,在展示第一文本信息和第二文本信息之后,用户可以根据第二文本信息的内容,修改第一文本信息。如此,可以使得第一文本信息在具有人工记录的语言精练、重点突出等特点的基础上,进一步的提升了准确性和全面性。再者,也为记录的制作人员带来了便利。
本实施方式还可以包括步骤S28:输出修改后的第一文本信息。
在本实施方式中,输出修改后的第一文本信息可以为将第一文本信息发送至打印设 备,或者,将第一文本信息通过邮件发送。
在一个实施方式中,所述信息处理方法还可以包括:在所述第一文本信息或者所述第二文本信息发生触发事件的情况下,播放发生触发事件的第一文本信息或者第二文本信息对应的音频信息。
在本实施方式中,触发事件可以是客户端接收到了点击操作或者划动操作。具体的,例如,用户使用鼠标针对第一文本信息或者第二文本信息进行点击操作,或者,移动鼠标划过第一文本信息或第二文本信息;或者,用户使用触摸显示设备进行了触摸点击操作。
在本实施方式中,第一文本信息或第二文本信息发生触发事件,可以是显示第一文本信息或者第二文本信息的区域发生了触发事件。也可以是对应第一文本信息或第二文本信息设置有播放按钮,在播放按钮发生点击操作时,认为相应的第一文本信息或第二文本信息发生了触发事件。
在本实施方式中,第一文本信息对应音频信息的方式,可以参阅其它实施方式的描述,在此不再赘述。第二文本信息对应的音频信息,可以是被识别得到第二文本信息的音频信息。
在一个实施方式中,所述信息处理方法还可以包括:在所述第一文本信息发生触发事件的情况下,将与所述第一文本信息对应的第二文本信息按照指定样式显示;或者,在所述第二文本信息发生触发事件的情况下,将与所述第二文本信息对应的第一文本信息按照指定样式显示。
在本实施方式中,指定样式可以包括但不限于:字体、大小、颜色、背景色、加粗效果、倾斜效果、下划线等等。
在本实施方式中,在第一文本信息发生触发事件时,将对应的第二文本信息按照指定样式显示,可以比较直观的指示二者的对应关系。还可以便于用户进行查阅对照。指定样式可以与其余文字的样式不同,如此使得发生对应的文本信息的显示状态与其它文字相区别,给用户带来了便利。同理,第二文本信息发生触发事件时,第一文本信息按照指定样式显示。
请参阅图12。本说明书实施方式还提供一种信息处理系统,所述信息处理系统可以 包括以下模块。
第一接收模块,用于接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成。
第二接收模块,用于接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成。
识别模块,用于对所述音频信息进行语音识别得到第二文本信息。
展示模块,用于展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
在本实施方式中,电子设备实现的功能和效果可以与其它实施方式对照解释,在此不再赘述。
请参阅图13。本说明书实施方式还提供一种信息处理系统,所述信息处理系统可以包括:输入设备、音频采集终端、显示器和处理器。
所述输入设备,用于接收用户输入的第一文本信息;所述第一文本信息为用户根据语音进行输入。
所述音频采集终端,用于录制音频信息;其中,所述音频信息为根据所述语音录制生成。
所述处理器,用于对所述音频信息进行语音识别得到第二文本信息。
所述显示器,用于展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
在本实施方式中,所述输入设备可以是具有信息输入功能的设备。具体的,例如,输入设备可以是键盘、手写板、手写笔等。当然,输入设备并不限于前述列举。
在本实施方式中,所述音频采集终端包括有传声器。所述传声器可以是将声音信号转换为电信号的能量转换器件。具体的,例如,传声器的具体设计,可以包括但不限于电动式、电容式、压电式、电磁式、碳粒式、半导体式等。
在本实施方式中,所述处理器可以按任何适当的方式实现。例如,所述处理器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式等等。
在本实施方式中,所述显示器可以提供界面展示。具体的,例如,显示器可以包括 根据制造材料的不同,可分为:阴极射线管显示器(Cathode Ray Tube)、等离子显示器(Plasma Display Panel)、液晶显示器(Liquid Crystal Display)、LED显示屏(Light Emitting Diode Panel)等等。当然,显示器并不限于平面显示器,其还可以为弧面显示器、立体显示器等。
本实施方式中提供的信息处理系统可以参见其它实施方式对照解释。
本说明书实施方式还提供一种计算机存储介质。所述计算机存储介质存储有计算机程序指令,所述计算机程序指令被执行时可以实现:接收第一输入设备输入的第一文本信息,所述第一文本信息为根据语音生成;接收第二输入设备录制的音频信息,其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
在本实施方式中,所述计算机存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、缓存(Cache)、硬盘(Hard Disk Drive,HDD)或者存储卡(Memory Card)。
本实施方式中提供的计算机存储介质,其计算机程序指令被执行时实现的功能和效果可以参见其它实施方式对照解释。
请参阅图14。本说明书实施方式还提供一种信息处理方法,所述信息处理方法可以包括以下步骤。
步骤S30:接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成。
步骤S32:接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成。
步骤S34:将所述音频信息或所述音频信息的表征信息发送给服务器,以用于所述服务器进行语音识别。
步骤S36:接收所述服务器反馈的语音识别得到的第二文本信息。
步骤S38:展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
在本实施方式中,客户端可以将音频信息发送给服务器,服务器可以对音频信息进 行语音识别。服务器可以将语音识别得到的第二文本信息发送给客户端。使得客户端可以展示第一文本信息和第二文本信息。在本实施方式中,服务器对音频信息的处理,可以参阅其它实施方式对照解释,在此不再赘述。
在本实施方式中,客户端和服务器可以基于网络通信协议进行数据传输。网络通信协议包括但不限于HTTP、TCP/IP、FTP等。
在本实施方式中,所述表征信息可以是根据音频信息进行运算生成,可以用于表示所述音频信息。具体的,例如,所述表征信息可以是根据音频信息生成的特征矩阵,或者针对特征矩阵进行端点检测处理之后的数据。
在本实施方式中,由服务器进行语音识别工作,如此减少了客户端的运算负荷。进而降低了对客户端的硬件性能要求。
请参阅图15。本说明书实施方式还提供一种信息处理系统。所述信息处理系统可以包括以下模块。
第一接收模块,用于接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成。
第二接收模块,用于接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成。
发送模块,用于将所述音频信息或所述音频信息的表征信息发送给服务器,以用于所述服务器进行语音识别。
第三接收模块,用于接收所述服务器反馈的语音识别得到的第二文本信息。
展示模块,用于展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
在本实施方式中,电子设备实现的功能和效果可以与其它实施方式对照解释,在此不再赘述。
请参阅图16。本说明书实施方式还提供一种信息处理系统。所述信息处理系统可以包括:输入设备、声音采集终端、网络通信单元、显示器。
所述输入设备,用于接收用户输入的第一文本信息;所述第一文本信息为用户根据语音进行输入。
所述声音采集终端,用于录制音频信息;其中,所述音频信息为根据所述语音录制 生成。
所述网络通信单元,用于将所述音频信息或所述音频信息的表征信息发送给服务器,以用于所述服务器进行语音识别;接收所述服务器反馈的语音识别得到的第二文本信息。
所述显示器,用于展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
在本实施方式中,所述网络通信单元可以是依照通信协议规定的标准设置的,用于进行网络连接通信的接口。网络通信单元可以包括但不限于有线通信和/或无线通信的方式。
本实施方式中提供的信息处理系统可以参见其它实施方式对照解释。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序指令。所述计算机程序指令被执行时可以实现:接收第一输入设备输入的第一文本信息,所述第一文本信息为根据语音生成;接收第二输入设备录制的音频信息,其中,所述音频信息为根据所述语音录制生成;将所述音频信息或所述音频信息的表征信息发送给服务器,以用于所述服务器进行语音识别;接收所述服务器反馈的语音识别得到的第二文本信息;展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
在本实施方式中,所述计算机存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、缓存(Cache)、硬盘(Hard Disk Drive,HDD)或者存储卡(Memory Card)。
本实施方式中提供的计算机存储介质,其计算机程序指令被执行时实现的功能和效果可以参见其它实施方式对照解释。
请参阅图17。本说明书实施方式还提供一种信息处理方法,所述信息处理方法可以包括以下步骤。
步骤S40:接收客户端发送的音频信息或者音频信息的表征信息。
步骤S42:对所述音频信息或者所述表征信息进行语音识别得到第二文本信息。
步骤S44:将所述第二文本信息发送给所述客户端,以用于所述客户端将所述第二文本信息与第一客户端的第一文本信息进行展示;其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
在本实施方式中,服务器接收客户端通过网络发送的音频信息或者表征信息。服务器中可以设置有针对音频信息或者表征信息进行语音识别的算法模型。具体的,针对音频信息或表征信息的处理方式,可以参阅前述实施方式中,针对音频信息和表征信息的描述,在此不再赘述。
在本实施方式中,服务器可以对应多个客户端,实现客户端不需要进行语音识别工作,而统一有服务器进行语音识别。如此降低了对客户端的硬件性能要求。再者,由服务器进行统一的语音识别工作,便于对语音识别工作的算法维护和更新升级。
本实施方式中描述的内容,可以参照其它实施方式对照解释,在此不再赘述。
请参阅图9。在一个实施方式中,服务器进行语音识别的步骤中可以包括以下步骤。
步骤S20:识别所述音频信息对应的身份信息。
步骤S22:识别所述音频信息中的语音表达的第二文本信息;所述第二文本信息与所述身份信息相关联。
在本实施方式中,服务器可以识别音频信息对应的身份信息,是的服务器可以将属于同一个用户的第二文本信息,与该用户的身份信息进行关联。服务器将身份信息放入所述第二文本信息中提供给客户端之后,便于用户通过客户端展示的内容阅读。
本实施方式具体实现的功能和效果,可以参照其它实施方式对照解释,在此不再赘述。
请参阅图18。本说明书实施方式还提供一种服务器,所述服务器可以包括以下模块。
接收模块,用于接收客户端发送的音频信息或音频信息的表征信息。
识别模块,用于对所述音频信息或者所述表征信息进行语音识别得到第二文本信息。
发送模块,用于将所述第二文本信息发送给所述客户端,以用于所述客户端将所述第二文本信息与第一客户端的第一文本信息进行展示;其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
在本实施方式中,服务器实现的功能和效果可以与其它实施方式对照解释,在此不再赘述。
请参阅图19。本说明书实施方式还提供一种电子设备,所述电子设备可以包括:网络通信单元和处理器。
所述网络通信单元,用于接收客户端发送的音频信息或音频信息的表征信息;将所述处理器提供的第二文本信息发送给所述客户端,以用于所述客户端将所述第二文本信息与第一客户端的第一文本信息进行展示;其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
所述处理器,用于对所述音频信息或者所述表征信息进行语音识别得到第二文本信息。
本实施方式中提供的电子设备,其实现的功能和效果可以参见其它实施方式对照解释。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质中存储有计算机程序指令,所述计算机程序指令被执行时可以实现:接收客户端发送的音频信息或音频信息的表征信息;对所述音频信息或者所述表征信息进行语音识别得到第二文本信息;将所述第二文本信息发送给所述客户端,以用于所述客户端将所述第二文本信息与第一客户端的第一文本信息进行展示,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
本实施方式中提供的计算机存储介质,其计算机程序指令被执行时实现的功能和效果可以参见其它实施方式对照解释。
请参阅图21。本说明书实施方式还提供一种信息处理方法。所述信息处理方法可以包括以下步骤。
步骤S50:接收第一文本信息;所述第一文本信息为根据语音生成。
步骤S52:接收音频信息;其中,所述音频信息为根据所述语音录制生成。
步骤S54:对所述音频信息进行语音识别得到第二文本信息。
步骤S56:根据所述第一文本信息和所述第二文本信息之间的对应关系进行排版,所述排版后的所述第一文本信息和所述第二文本信息用于展示。
在本实施方式中,执行该信息处理方法的电子设备,可以对音频信息进行语音识别得到第二文本信息。并进一步的,根据第一文本信息和第二文本信息之间的对应关系针对第一文本信息和第二文本信息进行排版。
在本实施方式中,对第一文本信息和第二文本信息进行排版。具体排版的方式可以参阅前述排版子系统的内容,或者其它实施方式对照解释。
本说明书实施方式还提供一种电子设备,所述电子设备包括处理器,所述处理器用于接收第一文本信息,所述第一文本信息为根据语音生成;接收音频信息,其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;根据所述第一文本信息和所述第二文本信息之间的对应关系进行排版,所述排版后的所述第一文本信息和所述第二文本信息用于展示。
本实施方式中提供的电子设备,其实现的功能和效果可以参见其它实施方式对照解释。
本说明书实施方式还提供一种计算机存储介质,所述计算机存储介质中存储有计算机程序指令,所述计算机程序指令被执行时实现:接收第一文本信息,所述第一文本信息为根据语音生成;接收音频信息,其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;根据所述第一文本信息和所述第二文本信息之间的对应关系进行排版,所述排版后的所述第一文本信息和所述第二文本信息用于展示。
本实施方式中提供的计算机存储介质,其计算机程序指令被执行时实现的功能和效果可以参见其它实施方式对照解释。
请一并参阅图22和图3。本说明书实施方式还提供一种信息处理方法,所述方法可以包括以下步骤。
步骤S60:接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成。
步骤S62:接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成。
步骤S64:对所述音频信息进行识别得到第二文本信息。
步骤S66:在第一区域展示所述第一文本信息。
步骤S68:在第二区域展示所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系,所述第一区域和第二区域位于相同的界面中。
在本实施方式中,可以将一个用于显示器展示的界面至少划分为第一区域和第二区域。其中,第一区域用于展示第一文本信息,第二区域用于展示第二文本信息。如此, 通过至少划分第一区域和第二区域,使得可以清晰的区别第一文本信息和第二文本信息,给浏览人员带来便利。
在本实施方式中,第一文本信息和第二文本信息的生成方式不同。第一文本信息可以是记录人员根据个人对语音的理解,输入的内容。使得第一文本信息可以具有较为概括,语言相对简练,重点相对突出的特点。第二文本信息可以是针对音频信息进行语音识别得到的内容。第二文本信息可以具有内容非常全面的特点。但第二文本信息相较于第一文本信息可能存在内容相对冗长,重点不够突出等问题。通过在第一文本信息和第二文本信息中的内容存在对应关系,使得浏览者可以对照查阅第一文本信息和第二文本信息,从而可以实现给阅读者带来便利。
本说明书中的各个实施方式均采用递进的方式描述,各个实施方式之间相同相似的部分互相参见即可,每个实施方式重点说明的都是与其他实施方式的不同之处。
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片2。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等,目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language) 与Verilog2。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。
本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本说明书可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本说明书的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本说明书各个实施方式或者实施方式的某些部分所述的方法。
虽然通过实施方式描绘了本说明书,本领域普通技术人员知道,本说明书有许多变形和变化而不脱离本说明书的精神,希望所附的权利要求包括这些变形和变化而不脱离本说明书的精神。

Claims (32)

  1. 一种信息处理方法,其特征在于,所述方法包括:
    接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成;
    接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成;
    对所述音频信息进行语音识别得到第二文本信息;
    展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
  2. 根据权利要求1所述的方法,其特征在于,在语音识别的步骤中包括:
    识别所述音频信息对应的身份信息;
    识别所述音频信息中的语音表达的第二文本信息;所述第二文本信息与所述身份信息相关联;
    相应的,在所述展示的步骤中,将所述身份信息与对应的第二文本信息对应展示。
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    将所述音频信息发送给服务器,以用于所述服务器确定所述音频信息对应的身份信息;
    接收所述服务器反馈的所述音频信息对应的身份信息;
    相应的,在展示的步骤中,将所述身份信息与对应的第二文本信息对应展示。
  4. 根据权利要求1所述的方法,其特征在于,在展示所述第一文本信息和所述第二文本信息的步骤中包括:将接收到的时间较为接近的第一文本信息和音频信息相对应,以对应展示所述第一文本信息和所述音频信息被识别得出的第二文本信息。
  5. 根据权利要求1所述的方法,其特征在于,在展示所述第一文本信息和所述第二文本信息的步骤中包括:
    将所述第一文本信息在指定时间范围内生成的音频信息的第二文本信息中,进行语义匹配,得到与所述第一文本信息对应的第二文本信息。
  6. 根据权利要求5所述的方法,其特征在于,在进行语义匹配的步骤中还包括:将 接收到所述第一文本信息的时间作为基准时间,根据所述基准时间设置所述指定时间范围,所述基准时间位于所述指定时间范围内。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    根据所述第一输入设备的输入对所述第一文本信息进行修改;
    输出修改后的第一文本信息。
  8. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在所述第一文本信息或者所述第二文本信息发生触发事件的情况下,播放发生触发事件的第一文本信息或者第二文本信息对应的音频信息。
  9. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在所述第一文本信息发生触发事件的情况下,将与所述第一文本信息对应的第二文本信息按照指定样式显示;或者,
    在所述第二文本信息发生触发事件的情况下,将与所述第二文本信息对应的第一文本信息按照指定样式显示。
  10. 一种信息处理系统,其特征在于,所述信息处理系统包括:输入设备、音频采集终端、显示器和处理器;
    所述输入设备,用于接收用户输入的第一文本信息;所述第一文本信息为用户根据语音进行输入;
    所述音频采集终端,用于录制音频信息;其中,所述音频信息为根据所述语音录制生成;
    所述处理器,用于对所述音频信息进行语音识别得到第二文本信息;
    所述显示器,用于展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
  11. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有计算机程序指令,所述计算机程序指令被执行时实现:接收第一输入设备输入的第一文本信息,所述第一文本信息为根据语音生成;接收第二输入设备录制的音频信息,其中,所述音频信 息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
  12. 一种信息处理方法,其特征在于,所述方法包括:
    接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成;
    接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成;
    将所述音频信息或所述音频信息的表征信息发送给服务器,以用于所述服务器进行语音识别;
    接收所述服务器反馈的语音识别得到的第二文本信息;
    展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
  13. 根据权利要求12所述的方法,其特征在于,在展示所述第一文本信息和所述第二文本信息的步骤中包括:将接收到的时间较为接近的第一文本信息和音频信息相对应,以对应展示所述第一文本信息和所述音频信息被识别得出的第二文本信息。
  14. 根据权利要求12所述的方法,其特征在于,在展示所述第一文本信息和所述第二文本信息的步骤中包括:
    将所述第一文本信息在指定时间范围内生成的音频信息的第二文本信息中,进行语义匹配,得到与所述第一文本信息对应的第二文本信息。
  15. 根据权利要求14所述的方法,其特征在于,在进行语义匹配的步骤中还包括:将接收到所述第一文本信息的时间作为基准时间,根据所述基准时间设置所述指定时间范围,所述基准时间位于所述指定时间范围内。
  16. 根据权利要求12所述的方法,其特征在于,所述方法还包括:
    根据所述第一输入设备的输入对所述第一文本信息进行修改;
    输出修改后的第一文本信息。
  17. 根据权利要求12所述的方法,其特征在于,所述方法还包括:
    在所述第一文本信息或者所述第二文本信息发生触发事件的情况下,播放发生触发事件的第一文本信息或者第二文本信息对应的音频信息。
  18. 根据权利要求12所述的方法,其特征在于,所述方法还包括:
    在所述第一文本信息发生触发事件的情况下,将与所述第一文本信息对应的第二文本信息按照指定样式显示;或者,
    在所述第二文本信息发生触发事件的情况下,将与所述第二文本信息对应的第一文本信息按照指定样式显示。
  19. 一种信息处理系统,其特征在于,所述信息处理系统包括:输入设备、音频采集终端、网络通信单元、显示器;
    所述输入设备,用于接收用户输入的第一文本信息;所述第一文本信息为用户根据语音进行输入;
    所述音频采集终端,用于录制音频信息;其中,所述音频信息为根据所述语音录制生成;
    所述网络通信单元,用于将所述音频信息或所述音频信息的表征信息发送给服务器,以用于所述服务器进行语音识别;接收所述服务器反馈的语音识别得到的第二文本信息;
    所述显示器,用于展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
  20. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有计算机程序指令,所述计算机程序指令被执行时实现:接收第一输入设备输入的第一文本信息,所述第一文本信息为根据语音生成;接收第二输入设备录制的音频信息,其中,所述音频信息为根据所述语音录制生成;将所述音频信息或所述音频信息的表征信息发送给服务器,以用于所述服务器进行语音识别;接收所述服务器反馈的语音识别得到的第二文本信息;展示所述第一文本信息和所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
  21. 一种信息处理方法,其特征在于,所述方法包括:
    接收客户端发送的音频信息或音频信息的表征信息;
    对所述音频信息或者所述表征信息进行语音识别得到第二文本信息;
    将所述第二文本信息发送给所述客户端,以用于所述客户端将所述第二文本信息与第一客户端的第一文本信息进行展示;其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
  22. 根据权利要求21所述的方法,其特征在于,在语音识别的步骤中包括:
    识别所述音频信息对应的身份信息;
    识别所述音频信息中的语音表达的第二文本信息;所述第二文本信息与所述身份信息相关联。
  23. 一种电子设备,其特征在于,所述电子设备包括:网络通信单元和处理器;
    所述网络通信单元,用于接收客户端发送的音频信息或音频信息的表征信息;将所述处理器提供的第二文本信息发送给所述客户端,以用于所述客户端将所述第二文本信息与第一客户端的第一文本信息进行展示;其中,所述第一文本信息和所述第二文本信息中内容存在对应关系;
    所述处理器,用于对所述音频信息或者所述表征信息进行语音识别得到第二文本信息。
  24. 一种计算机存储介质,其特征在于,所述计算机存储介质中存储有计算机程序指令,所述计算机程序指令被执行时可以实现:接收客户端发送的音频信息或音频信息的表征信息;对所述音频信息或者所述表征信息进行语音识别得到第二文本信息;将所述第二文本信息发送给所述客户端,以用于所述客户端将所述第二文本信息与第一客户端的第一文本信息进行展示,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系。
  25. 一种信息处理方法,其特征在于,包括:
    接收第一文本信息;所述第一文本信息为根据语音生成;
    接收音频信息;其中,所述音频信息为根据所述语音录制生成;
    对所述音频信息进行语音识别得到第二文本信息;
    根据所述第一文本信息和所述第二文本信息之间的对应关系进行排版,所述排版后的所述第一文本信息和所述第二文本信息用于展示。
  26. 根据权利要求25所述的方法,其特征在于,在对所述第一文本信息和所述第二文本信息进行排版的步骤中包括:将接收到的时间较为接近的第一文本信息和音频信息相对应,以对应展示所述第一文本信息和所述音频信息被识别得出的第二文本信息。
  27. 根据权利要求25所述的方法,其特征在于,在对所述第一文本信息和所述第二文本信息进行排版的步骤中包括:
    将所述第一文本信息在指定时间范围内生成的音频信息的第二文本信息中,进行语义匹配,得到与所述第一文本信息对应的第二文本信息。
  28. 根据权利要求27所述的方法,其特征在于,在进行语义匹配的步骤中还包括:将接收到所述第一文本信息的时间作为基准时间,根据所述基准时间设置所述指定时间范围,所述基准时间位于所述指定时间范围内。
  29. 根据权利要求25所述的方法,其特征在于,所述方法还包括:
    接收对所述第一文本信息进行修改的输入,对第一文本信息进行修改。
  30. 一种电子设备,其特征在于,所述电子设备包括处理器,
    所述处理器用于接收第一文本信息,所述第一文本信息为根据语音生成;接收音频信息,其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;根据所述第一文本信息和所述第二文本信息之间的对应关系进行排版,所述排版后的所述第一文本信息和所述第二文本信息用于展示。
  31. 一种计算机存储介质,其特征在于,所述计算机存储介质中存储有计算机程序指令,所述计算机程序指令被执行时实现:接收第一文本信息,所述第一文本信息为根据语音生成;接收音频信息,其中,所述音频信息为根据所述语音录制生成;对所述音频信息进行语音识别得到第二文本信息;根据所述第一文本信息和所述第二文本信息之间的对应关系进行排版,所述排版后的所述第一文本信息和所述第二文本信息用于展示。
  32. 一种信息处理方法,其特征在于,所述方法包括:
    接收第一输入设备输入的第一文本信息;所述第一文本信息为根据语音生成;
    接收第二输入设备录制的音频信息;其中,所述音频信息为根据所述语音录制生成;
    对所述音频信息进行识别得到第二文本信息;
    在第一区域展示所述第一文本信息;
    在第二区域展示所述第二文本信息,其中,所述第一文本信息和所述第二文本信息中内容存在对应关系,所述第一区域和第二区域位于相同的界面中。
PCT/CN2018/095081 2017-07-19 2018-07-10 信息处理方法、系统、电子设备、和计算机存储介质 WO2019015505A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/742,753 US11664030B2 (en) 2017-07-19 2020-01-14 Information processing method, system, electronic device, and computer storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710592686.1A CN109285548A (zh) 2017-07-19 2017-07-19 信息处理方法、系统、电子设备、和计算机存储介质
CN201710592686.1 2017-07-19

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/742,753 Continuation US11664030B2 (en) 2017-07-19 2020-01-14 Information processing method, system, electronic device, and computer storage medium

Publications (1)

Publication Number Publication Date
WO2019015505A1 true WO2019015505A1 (zh) 2019-01-24

Family

ID=65015969

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095081 WO2019015505A1 (zh) 2017-07-19 2018-07-10 信息处理方法、系统、电子设备、和计算机存储介质

Country Status (3)

Country Link
US (1) US11664030B2 (zh)
CN (1) CN109285548A (zh)
WO (1) WO2019015505A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015872A (zh) * 2019-05-29 2020-12-01 华为技术有限公司 问句识别方法及装置
CN112397090A (zh) * 2020-11-09 2021-02-23 电子科技大学 一种基于fpga的实时声音分类方法及系统
CN113761968A (zh) * 2020-06-01 2021-12-07 阿里巴巴集团控股有限公司 数据处理方法、装置、电子设备及计算机存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020153934A1 (en) * 2019-01-21 2020-07-30 Hewlett-Packard Development Company, L.P. Fault prediction model training with audio data
CN111177353B (zh) * 2019-12-27 2023-06-09 赣州得辉达科技有限公司 文本记录生成方法、装置、计算机设备及存储介质
CN111583907B (zh) * 2020-04-15 2023-08-15 北京小米松果电子有限公司 信息处理方法、装置及存储介质
CN111489522A (zh) * 2020-05-29 2020-08-04 北京百度网讯科技有限公司 用于输出信息的方法、装置和系统
CN113111658B (zh) * 2021-04-08 2023-08-18 百度在线网络技术(北京)有限公司 校验信息的方法、装置、设备和存储介质
WO2024101641A1 (ko) * 2022-11-08 2024-05-16 삼성전자 주식회사 전자 장치 및 누락된 필기 보완 방법

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060167686A1 (en) * 2003-02-19 2006-07-27 Jonathan Kahn Method for form completion using speech recognition and text comparison
CN101763382A (zh) * 2008-12-25 2010-06-30 新奥特硅谷视频技术有限责任公司 一种基于角色和优先级设置的信息处理的方法和装置
CN101833985A (zh) * 2009-03-12 2010-09-15 新奥特硅谷视频技术有限责任公司 一种基于语音识别的法庭庭审录像视频实时标引系统
CN102906735A (zh) * 2010-05-21 2013-01-30 微软公司 语音流增强的笔记记录
CN106782551A (zh) * 2016-12-06 2017-05-31 北京华夏电通科技有限公司 一种语音识别系统及方法

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU3588200A (en) * 1999-02-05 2000-08-25 Custom Speech Usa, Inc. System and method for automating transcription services
US6477491B1 (en) * 1999-05-27 2002-11-05 Mark Chandler System and method for providing speaker-specific records of statements of speakers
US6671669B1 (en) * 2000-07-18 2003-12-30 Qualcomm Incorporated combined engine system and method for voice recognition
WO2002037223A2 (en) * 2000-11-06 2002-05-10 Invention Machine Corporation Computer based integrated text and graphic document analysis
DE10204924A1 (de) * 2002-02-07 2003-08-21 Philips Intellectual Property Verfahren und Vorrichtung zur schnellen mustererkennungsunterstützten Transkription gesprochener und schriftlicher Äußerungen
US7386454B2 (en) * 2002-07-31 2008-06-10 International Business Machines Corporation Natural error handling in speech recognition
CN100440353C (zh) * 2004-06-15 2008-12-03 梁国雄 用于庭审的计算机录音信息系统
US7676705B2 (en) * 2005-12-30 2010-03-09 Sap Ag User interface messaging system and method permitting deferral of message resolution
US8019608B2 (en) * 2008-08-29 2011-09-13 Multimodal Technologies, Inc. Distributed speech recognition using one way communication
CN105810207A (zh) * 2014-12-30 2016-07-27 富泰华工业(深圳)有限公司 会议记录装置及其自动生成会议记录的方法
CN105427857B (zh) * 2015-10-30 2019-11-08 华勤通讯技术有限公司 生成文字记录的方法及系统
CN106372122B (zh) * 2016-08-23 2018-04-10 温州大学瓯江学院 一种基于维基语义匹配的文档分类方法及系统
CN106782545B (zh) * 2016-12-16 2019-07-16 广州视源电子科技股份有限公司 一种将音视频数据转化成文字记录的系统和方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060167686A1 (en) * 2003-02-19 2006-07-27 Jonathan Kahn Method for form completion using speech recognition and text comparison
CN101763382A (zh) * 2008-12-25 2010-06-30 新奥特硅谷视频技术有限责任公司 一种基于角色和优先级设置的信息处理的方法和装置
CN101833985A (zh) * 2009-03-12 2010-09-15 新奥特硅谷视频技术有限责任公司 一种基于语音识别的法庭庭审录像视频实时标引系统
CN102906735A (zh) * 2010-05-21 2013-01-30 微软公司 语音流增强的笔记记录
CN106782551A (zh) * 2016-12-06 2017-05-31 北京华夏电通科技有限公司 一种语音识别系统及方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015872A (zh) * 2019-05-29 2020-12-01 华为技术有限公司 问句识别方法及装置
CN113761968A (zh) * 2020-06-01 2021-12-07 阿里巴巴集团控股有限公司 数据处理方法、装置、电子设备及计算机存储介质
CN112397090A (zh) * 2020-11-09 2021-02-23 电子科技大学 一种基于fpga的实时声音分类方法及系统
CN112397090B (zh) * 2020-11-09 2022-11-15 电子科技大学 一种基于fpga的实时声音分类方法及系统

Also Published As

Publication number Publication date
US20200152200A1 (en) 2020-05-14
US11664030B2 (en) 2023-05-30
CN109285548A (zh) 2019-01-29

Similar Documents

Publication Publication Date Title
WO2019015505A1 (zh) 信息处理方法、系统、电子设备、和计算机存储介质
US11495224B2 (en) Contact resolution for communications systems
CN108197115B (zh) 智能交互方法、装置、计算机设备和计算机可读存储介质
CN110265040B (zh) 声纹模型的训练方法、装置、存储介质及电子设备
US9501743B2 (en) Method and apparatus for tailoring the output of an intelligent automated assistant to a user
Schuller et al. Being bored? Recognising natural interest by extensive audiovisual integration for real-life application
KR100586767B1 (ko) 다중모드 입력을 이용한 다중모드 초점 탐지, 기준 모호성해명 및 기분 분류를 위한 시스템 및 방법
US20190005953A1 (en) Hands free always on near field wakeword solution
US20150325240A1 (en) Method and system for speech input
US9093072B2 (en) Speech and gesture recognition enhancement
Griol et al. Combining speech-based and linguistic classifiers to recognize emotion in user spoken utterances
US10896688B2 (en) Real-time conversation analysis system
Emerich et al. Emotions recognition by speechand facial expressions analysis
EP3729422B1 (en) Detecting continuing conversations with computing devices
US6826306B1 (en) System and method for automatic quality assurance of user enrollment in a recognition system
WO2019017922A1 (en) AUTOMATED VOICE ACCOMPANIMENT SYSTEMS AND METHODS
WO2023048746A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
CN113743267B (zh) 一种基于螺旋和文本的多模态视频情感可视化方法及装置
Hashem et al. Speech emotion recognition approaches: A systematic review
Haider et al. An active data representation of videos for automatic scoring of oral presentation delivery skills and feedback generation
CN111862943A (zh) 语音识别方法和装置、电子设备和存储介质
Jia et al. A deep learning system for sentiment analysis of service calls
US10282417B2 (en) Conversational list management
US10841411B1 (en) Systems and methods for establishing a communications session
Condron et al. Non-Verbal Vocalisation and Laughter Detection Using Sequence-to-Sequence Models and Multi-Label Training.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18835540

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18835540

Country of ref document: EP

Kind code of ref document: A1