WO2023185641A1 - 数据处理方法及电子设备 - Google Patents

数据处理方法及电子设备 Download PDF

Info

Publication number
WO2023185641A1
WO2023185641A1 PCT/CN2023/083455 CN2023083455W WO2023185641A1 WO 2023185641 A1 WO2023185641 A1 WO 2023185641A1 CN 2023083455 W CN2023083455 W CN 2023083455W WO 2023185641 A1 WO2023185641 A1 WO 2023185641A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
audio
segment
sentence
timestamp
Prior art date
Application number
PCT/CN2023/083455
Other languages
English (en)
French (fr)
Inventor
丁小龙
徐亮
卞苏成
李英浩
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023185641A1 publication Critical patent/WO2023185641A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • G06F3/04817Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance using icons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04842Selection of displayed objects or displayed text elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04847Interaction techniques to control parameter settings, e.g. interaction with sliders or dials
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72436User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for text messaging, e.g. short messaging services [SMS] or e-mails

Definitions

  • the embodiments of the present application relate to the technical field of terminal equipment, and in particular, to a data processing method and electronic equipment.
  • users often need to use electronic devices to record live speech and convert the recording into text after the recording is completed. After the recording is completed, the user has modification requirements for the converted text. The user can operate the electronic device to play back the recording to guide the user to modify the text.
  • this application provides a data processing method and electronic equipment.
  • text conversion can be performed during the recording process, which can improve the text conversion efficiency, and the original recording data on the timestamp and number of characters can be obtained during the recording-to-text process for fixed-point audio playback, which can improve the audio and Accuracy of fixed-point playback of text.
  • embodiments of the present application provide a data processing method, which can be applied to electronic devices.
  • the method includes: in response to the received first user operation, obtaining first information in the process of converting audio data into text data; wherein the audio data is audio data collected in real time, and the first information includes A first mapping relationship between a first timestamp of a first audio segment and a first number of characters of a first text segment, where the first text segment is a first text conversion result of the first audio segment, and the audio
  • the data includes at least one first audio segment, the text data includes at least one first text segment, and the first timestamp is a starting time point or an ending time point used to identify the first audio segment.
  • the timestamp in response to the received second user operation, based on the first information, update the playback progress of the audio data to the first starting time point of the second audio segment, and update the second text segment to the predetermined Assume that the display mode is displayed; wherein the second text segment is the second text conversion result of the second audio segment; the second audio segment includes at least one of the first audio segments, and the second text segment includes at least one of said first text fragments.
  • the text segment involved in the first mapping relationship in the first information is the final text conversion result (referred to as the final result) of the corresponding audio segment.
  • the first text conversion result refers to the final text of the first audio segment.
  • Conversion result similarly, the second text conversion result refers to the final text conversion result of the second audio clip.
  • the second user operation may be an operation of adjusting the playback progress of the recorded audio data to achieve reverse fixed-point playback, or an operation of converting the text data of the audio data to achieve forward fixed-point playback.
  • the form of the second user operation may also be voice input to implement the above-mentioned forward fixed-point playback or reverse fixed-point playback, and this application does not limit this.
  • this embodiment can convert the real-time collected audio into text during the process of real-time collection of audio, and obtain the first information (also called original recording data) during the process of converting audio to text, where the original recording data may include a mapping relationship between the timestamp of the audio segment and the number of characters of the text segment, where the text segment is the final result when the electronic device performs text conversion on the audio segment.
  • the timestamp and number of characters in the original recording data are relatively accurate.
  • the second audio segment that needs to be played can be more accurately located, and the second audio segment that needs to be played can be more accurately located.
  • the final text conversion result of the second audio segment can be relatively accurately located, and the audio and text can be accurately located.
  • recording to text can be completed on the electronic device side, which can improve the efficiency of recording to text.
  • the last character in the first text segment is a preset punctuation mark, wherein the preset punctuation mark is a punctuation mark that semantically represents sentence segmentation.
  • preset punctuation marks may include, but are not limited to: commas, periods, exclamation marks, question marks, semicolons, pauses, etc.
  • the final text data after converting the audio data into text may include at least one first text segment, and the last character in each first text segment is the preset punctuation mark. This is achieved by Text conversion of the audio and facilitating the use of preset punctuation marks to determine the first timestamp of the corresponding at least one first audio segment can improve the accuracy of the first timestamp in the original recorded data.
  • the process of converting audio data into text data includes: converting the audio data into the text data
  • a second timestamp is obtained based on the preset punctuation mark in the third text conversion result
  • the second Timestamp record or update the first correspondence between the arrangement order and the timestamp corresponding to the intermediate result
  • the second timestamp is used to identify the generation time of the third text conversion result, the arrangement The order is used to represent the arrangement order of the preset punctuation marks in the intermediate result
  • the first mapping relationship is obtained based on the second correspondence relationship and the fourth text conversion result; wherein the second correspondence relationship is the first correspondence corresponding to the most recently detected intermediate result. relation.
  • the audio clip to be converted can be converted into a temporary result (ie, an intermediate result), and the temporary result can be updated iteratively.
  • the temporary result obtained by the latest conversion will be used.
  • the result is taken as the final result (i.e. the final text conversion result).
  • the fourth text conversion result is the same as the most recently detected third text conversion result, that is, the text content of the final result is the same as the most recently detected third text conversion result.
  • the textual content of temporary results detected at one time is the same.
  • the accuracy of the audio-to-text algorithm used by the electronic device can be a single character, that is, each time a character (single word, single word or symbol, etc.) is added or subtracted from the temporary result, an updated temporary result will be output. result.
  • the generation time of the temporary result and the output time of the temporary result can be the same, then based on the temporary result to obtain the second timestamp (such as the current audio collection duration, for example, the current recording duration).
  • the electronic device can generate at least one first mapping relationship in the original recorded data based on the correspondence between the arrangement order corresponding to the most recently checked temporary result and the timestamp. .
  • original recording data for the recorded audio data and the text data converted to the audio data may be generated. This embodiment can improve the accuracy of each first mapping relationship in the original recorded data.
  • obtaining the second timestamp based on the preset punctuation marks in the third text conversion result includes: detecting that the third The text conversion result includes the preset punctuation mark, and the third text conversion result is the first intermediate result, and the second timestamp is obtained; or, the preset punctuation mark in the third text conversion result is detected
  • the first number is greater than the second number of the preset punctuation marks in the last third text conversion result, and the second timestamp is obtained; or, it is detected that the first number is less than the second number, Get the second timestamp.
  • a set of at least one temporary result (ie, intermediate result) and a final result corresponding to the at least one temporary result may be generated sequentially. After generating this final result, you can continue to generate the next set of at least one temporary result, and the next final result corresponding to the next set of at least one temporary result, then the final result obtained in sequence is the final text conversion of the audio data result. Then, every time a group of at least one temporary result is generated, for the first temporary result generated, and the first temporary result includes preset punctuation marks, the current recording duration can be obtained to obtain the second timestamp.
  • the acquisition of the current recording duration can also be triggered to obtain the second timestamp.
  • the electronic device After obtaining the second timestamp, the electronic device can be used to continue updating the latest first mapping relationship of the temporary result to improve the accuracy of the first timestamp in the original recorded data.
  • recording or updating the first correspondence between the arrangement order and the timestamp corresponding to the intermediate result based on the second timestamp includes: : In the first correspondence corresponding to the intermediate result, record or add a correspondence between the last arrangement order and the second timestamp; or, in the first correspondence corresponding to the intermediate result In the corresponding relationship, delete the corresponding relationship between the last arrangement order and the timestamp to update the first correspondence relationship, and update the timestamp corresponding to the current last arrangement order in the updated first correspondence relationship. is the second timestamp.
  • a record may be added.
  • the first correspondence includes the correspondence between the sorting order 0 and the timestamp 0, and the correspondence between the sorting order 1 and the timestamp 1. Then, to the first correspondence, add the last sorting order and the second timestamp (for example, time When the corresponding relationship is the stamp 2), the corresponding relationship between the sorting order 2 and the timestamp 2 can be added to the first correspondence relationship.
  • Sorting order 2 here is an example of the last sorting order mentioned above.
  • the first correspondence includes the correspondence between sort order 0 and timestamp 0, and the correspondence between sort order 1 and timestamp 1.
  • the last correspondence between sort order and timestamp can be deleted, that is, the correspondence between sort order 1 and timestamp 1 is deleted, then the updated first correspondence includes the correspondence between ranking order 0 and timestamp 0.
  • the current last ranking order (here is the ranking) in the updated first correspondence can be
  • the timestamp corresponding to order 0) (here is timestamp 0) is updated to the second timestamp (here is timestamp 2), so that the order 0 in the updated first correspondence relationship corresponds to timestamp 2.
  • obtaining the first mapping relationship based on the second correspondence relationship and the fourth text conversion result includes: based on the second correspondence relationship determine the first character number of each of the at least one first text fragment in the fourth text conversion result; based on the corresponding arrangement order and timestamp in the second correspondence relationship, determine and In the audio data corresponding to the fourth text conversion result, the first timestamp of each of the at least one first audio segment; based on the arrangement order in the second correspondence relationship, obtain the first timestamp and The first mapping relationship of the first number of characters, wherein the first timestamps in the same order and the first number of characters are mapped to each other.
  • the second correspondence relationship includes: arrangement order 0 corresponds to timestamp 2, and arrangement order 1 corresponds to timestamp 3.
  • the type of the fourth text conversion result here is the final result.
  • the fourth text conversion result can include 2 preset punctuation marks.
  • the fourth text conversion result is "Hello! My name is Zhang Three.” Then the fourth text conversion result can be converted according to the preset punctuation marks to determine the first text fragment and its first number of characters.
  • the number of first characters l0 corresponding to the fourth text conversion result with the order of 0 is 3 (that is, the number of characters in the text "Hello!), and the number l1 of the first character in the order of 1 is 5 (that is, the number of characters in the text "My name is Zhang San.”).
  • the timestamp of the first audio segment with the order of 0 in the pre-conversion audio data of the fourth text conversion result is timestamp 2
  • the timestamp of the first audio segment with the order of 1 is timestamp 3.
  • the first mapping relationship between the first number of characters l0 of the 0th first text segment and the timestamp 2 of the 0th first audio segment can be obtained, as well as the first mapping relationship of the 2nd first text segment
  • the playback progress of the audio data is updated to the second audio segment. the first starting time point, and displaying the second text fragment in a preset display mode, including: in response to the received second user operation, determining at least one first mapping relationship in the first information; based on The at least one first mapping relationship and the audio data determine at least one third audio segment, wherein the third audio segment with the earliest first timestamp in the at least one third audio segment is the second audio segment; based on the at least one first mapping relationship and the text data, determine at least a third text segment, the second text segment includes the at least one third text segment; based on the first information, The playback progress of the audio data is updated to the first starting time point of the second audio segment; the second text segment in the text data is displayed in a preset display mode.
  • the final text conversion result with at least one third text segment is at least one third audio segment.
  • the third audio segment is a continuous audio segment in the audio data.
  • the second user operation includes a first operation on the text data
  • the first operation includes at least one click position
  • the response to receiving Determining at least one first mapping relationship in the first information based on the second user operation received includes: in response to the received first operation on the text data, based on the first information and Described at least one click position, confirm Determine at least a second number of characters in the text data that are respectively located before the at least one click position; based on the first information and the at least one second number of characters, determine the number of characters in the first information At least one first mapping relationship.
  • this embodiment can be a forward fixed-point playback scenario.
  • Each click position is determined by the user clicking on at least one click position in the text data.
  • the corresponding first mapping relationship in the original record data is the click position here. is at least one, therefore, the determined first mapping relationship is at least one.
  • the first operation only includes one click position, and the number of the at least one first mapping relationship is one.
  • the click position can be a position between two characters in the text data, or a character.
  • This application does not limit this.
  • a second audio clip to be played, and a second audio clip can be determined.
  • the second text fragment, therefore, the number of first mapping relationships is one.
  • the first operation includes two click positions, the first operation is used to select at least two characters, and in the at least one third audio segment
  • the plurality of third audio clips are audio clips with continuous playback time in the audio data.
  • the first operation can be an operation of selecting at least two characters of text data, such as a text selection operation. Then the starting position and the ending position in the text selection operation are two click positions, and the two clicks can be position, respectively determine their respective first mapping relationships in the original record data, then if the two determined first mapping relationships are the same, it may be the case where the user selects at least two characters in a sentence.
  • a sentence here can be understood as a first text fragment. If the two determined first mapping relationships are different, the number of the at least one third audio segment is multiple, and the at least one third audio segment may include a starting audio segment and an ending audio segment, and the two texts Continuous audio clips of audio clips between clips.
  • the playback progress of the audio data is updated to the first starting time point of the second audio segment, and the second text segment is displayed in a preset
  • the method further includes: in response to the received third user operation, starting from the first starting time point of the second audio segment, according to the respective first timestamp of the at least one third audio segment.
  • the at least one third audio clip is played in sequence from early to late. This embodiment can play at least one third audio clip corresponding to the user's forward fixed-point playback or reverse fixed-point playback.
  • the first operation only includes one click position, the number of the at least one third audio segment is one, and the number of the at least one third text segment is The number is one, and the second audio segment is the same as the third audio segment; after the at least one third audio segment is played in sequence, the method further includes: after playing the third audio segment of the third audio segment, At an end time point, based on the first information, continue to play the fourth audio segment whose second starting time point is the first end time point; when playing to the first end time point of the third audio segment When, the display mode of the third text segment is restored to the original display mode, and based on the first information, the display mode of the fourth text segment corresponding to the fourth audio segment is updated from the original display mode. is the default display mode.
  • the click position can be a character or a position between two characters
  • the playback progress of the audio data can be adjusted to the click position.
  • Set the starting time point of the corresponding third audio segment and display the third text segment corresponding to the third audio segment in a preset display mode.
  • the next audio segment can be continued to be played and the third text segment that has been played can be restored to its original display mode, and the next audio segment to be played can be corresponding to
  • the text clips are displayed in the preset display mode, allowing the user to perform a forward fixed-point playback operation, and then realize the automatic fixed-point playback of subsequent audio clips.
  • the first operation is used for at least two selected characters; after the at least one third audio clip is played in sequence, the method further includes : When playing to the second end time point of the third audio segment with the latest first time stamp in the at least one third audio segment, pause the playback of the at least one third audio segment, and move the at least one third audio segment The display mode of the three text fragments is restored to the original display mode.
  • the subsequent audio segments in the audio data can be stopped to be played. , and restoring the display mode of the at least one third text segment corresponding to the at least one third audio segment to the original display mode. It is convenient for the user to re-listen to the at least one third audio segment to perform editing and correction operations on the at least one third text segment, thereby improving text proofreading efficiency.
  • the second user operation includes an adjustment operation on the playback progress of the audio data
  • the adjustment operation includes the playback progress time of the audio data
  • Determining at least one first mapping relationship in the first information in response to the received second user operation includes: responding to the received adjustment operation on the playback progress of the audio data, based on the The first information and the playback progress time are used to determine a first mapping relationship in the audio data; wherein the number of the at least one third audio segment is one, and within the time range corresponding to the third audio segment Including the playback progress time; wherein the time range is a time range composed of the third starting time point and the third ending time point of the third audio segment.
  • this embodiment can be a reverse fixed-point playback scenario.
  • the user can adjust the playback progress bar of the audio data that has been recorded to change the playback progress of the audio.
  • the electronic device can adjust the playback progress time that the user adjusts to, based on The original recording data determines which audio segment corresponds to which sentence in the audio data the playback progress time belongs to, so as to realize reverse fixed-point playback of the audio.
  • inventions of the present application provide an electronic device.
  • the electronic device includes: a memory and a processor, the memory is coupled to the processor; the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes the first aspect and a method in any implementation of the first aspect.
  • embodiments of the present application provide a computer-readable medium for storing a computer program.
  • the electronic device When the computer program is run on an electronic device, the electronic device causes the electronic device to execute the first aspect and the first aspect. method in any embodiment.
  • embodiments of the present application provide a chip, which includes one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of the electronic device and provide them to the processor.
  • the processor sends the signal, and the signal includes computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device is caused to execute the first aspect and any one of the implementation manners of the first aspect.
  • embodiments of the present application provide a computer program product containing instructions.
  • the computer program product When the computer program product is run on a computer, it causes the computer to execute the first aspect and any one of the implementation methods of the first aspect. method in.
  • Figure 1 is one of the structural schematic diagrams of an exemplary electronic device
  • Figure 2 is a schematic diagram of the software structure of an exemplary electronic device
  • Figure 3 is a schematic diagram of a recording-to-text interface in traditional technology
  • Figure 4a is a flow chart of an exemplary fixed-point playback method
  • Figure 4b is a flow chart of an exemplary fixed-point playback method
  • Figure 5 is a schematic diagram illustrating an application scenario of an electronic device
  • Figure 6a is a schematic diagram illustrating an audio-to-text process
  • Figure 6b is a schematic structural diagram of an exemplary original recording data
  • Figure 7a is a schematic diagram illustrating an exemplary data processing process
  • Figure 7b is a schematic diagram of an exemplary data processing process
  • Figure 8a is a schematic diagram illustrating an application scenario of an electronic device
  • Figure 8b is a schematic diagram illustrating an application scenario of an electronic device
  • Figure 8c is a schematic diagram illustrating an application scenario of an electronic device
  • Figure 8d is a schematic diagram illustrating an application scenario of an electronic device
  • Figure 8e is a schematic diagram illustrating an application scenario of an electronic device
  • Figure 8f is a schematic diagram illustrating an application scenario of an electronic device
  • Figure 8g is a schematic diagram illustrating an application scenario of an electronic device
  • Figure 9 is a schematic structural diagram of a device provided by an embodiment of the present application.
  • a and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations.
  • first and second in the description and claims of the embodiments of this application are used to distinguish different objects, rather than to describe a specific order of objects.
  • first target object, the second target object, etc. are used to distinguish different target objects, rather than to describe a specific order of the target objects.
  • multiple processing units refer to two or more processing units; multiple systems refer to two or more systems.
  • FIG. 1 shows a schematic structural diagram of an electronic device 100 .
  • the electronic device 100 shown in FIG. 1 is only an example of an electronic device.
  • the electronic device 100 may be a terminal, which may also be called a terminal device.
  • the terminal may be a cellular phone or a tablet computer. (pad), wearable devices or Internet of Things devices, etc., are not limited in this application.
  • the electronic device 100 may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations.
  • the various components shown in Figure 1 may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
  • the electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2.
  • Mobile communication module 150 wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, And subscriber identification module (subscriber identification module, SIM) card interface 195, etc.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (DSP), baseband processor, and/or neural-network processor processing unit, NPU), etc.
  • application processor application processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processor processing unit
  • different processing units can be independent devices or integrated in one or more processors.
  • the controller may be the nerve center and command center of the electronic device 100 .
  • the controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
  • processor 110 may include one or more interfaces.
  • Interfaces may include integrated circuit (inter-integrated circuit, I2C) interface, integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, pulse code modulation (pulse code modulation, PCM) interface, universal asynchronous receiver and transmitter (universal asynchronous receiver/transmitter (UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and /or universal serial bus (USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • UART universal asynchronous receiver and transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the USB interface 130 is an interface that complies with the USB standard specification, and may be a Mini USB interface, a Micro USB interface, a USB Type C interface, etc.
  • the USB interface 130 can be used to connect a charger to charge the electronic device 100, and can also be used to transmit data between the electronic device 100 and peripheral devices. It can also be used to connect headphones to play audio through them. This interface can also be used to connect other electronic devices, such as AR devices, etc.
  • the interface connection relationships between the modules illustrated in the embodiments of the present application are only schematic illustrations and do not constitute a structural limitation of the electronic device 100 .
  • the electronic device 100 may also adopt different interface connection methods in the above embodiments, or a combination of multiple interface connection methods.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the charging management module 140 may receive charging input from the wired charger through the USB interface 130 .
  • the charging management module 140 may receive wireless charging input through the wireless charging coil of the electronic device 100 . While the charging management module 140 charges the battery 142, it can also provide power to the electronic device through the power management module 141.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, internal memory 121, external memory, display screen 194, camera 193, wireless communication module 160, etc.
  • the power management module 141 can also be used to monitor battery capacity, battery cycle times, battery health status (leakage, impedance) and other parameters.
  • the power management module 141 may also be provided in the processor 110 .
  • the power management module 141 and the charging management module 140 may also be provided in the same device.
  • the wireless communication function of the electronic device 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example: you can Antenna 1 is multiplexed as the diversity antenna of the wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.
  • the mobile communication module 150 can provide solutions for wireless communication including 2G/3G/4G/5G applied on the electronic device 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves through the antenna 1 for radiation.
  • at least part of the functional modules of the mobile communication module 150 may be disposed in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be provided in the same device.
  • a modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low-frequency baseband signal to be sent into a medium-high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal.
  • the demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the application processor outputs sound signals through audio devices (not limited to speaker 170A, receiver 170B, etc.), or displays images or videos through display screen 194.
  • the modem processor may be a stand-alone device.
  • the modem processor may be independent of the processor 110 and may be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (bluetooth, BT), and global navigation satellites.
  • WLAN wireless local area networks
  • System global navigation satellite system, GNSS
  • frequency modulation frequency modulation, FM
  • near field communication technology near field communication, NFC
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110, frequency modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.
  • the antenna 1 of the electronic device 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology.
  • the electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the display screen 194 is used to display images, videos, etc.
  • Display 194 includes a display panel.
  • the display panel can use a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light).
  • LED organic light-emitting diode
  • AMOLED organic light-emitting diode
  • FLED flexible light-emitting diode
  • Miniled MicroLed, Micro-oLed, quantum dot light emitting diode (QLED), etc.
  • the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the electronic device 100 can implement the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
  • Camera 193 is used to capture still images or video.
  • the object passes through the lens to produce an optical image that is projected onto the photosensitive element.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the processor 110 executes instructions stored in the internal memory 121 to execute various functional applications and data processing of the electronic device 100 .
  • the internal memory 121 may include a program storage area and a data storage area. Among them, the stored program area can store an operating system, at least one application program required for a function (such as a sound playback function, an image playback function, etc.).
  • the storage data area may store data created during use of the electronic device 100 (such as audio data, phone book, etc.).
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash storage (UFS), etc.
  • the electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • the buttons 190 include a power button, a volume button, etc.
  • Key 190 may be a mechanical key. It can also be a touch button.
  • the electronic device 100 may receive key inputs and generate key signal inputs related to user settings and function control of the electronic device 100 .
  • the motor 191 can generate vibration prompts.
  • the motor 191 can be used for vibration prompts for incoming calls and can also be used for touch vibration feedback.
  • touch operations for different applications can correspond to different vibration feedback effects.
  • the motor 191 can also respond to different vibration feedback effects for touch operations in different areas of the display screen 194 .
  • Different application scenarios such as time reminders, receiving information, alarm clocks, games, etc.
  • the touch vibration feedback effect can also be customized.
  • the indicator 192 may be an indicator light, which may be used to indicate charging status, power changes, or may be used to indicate messages, missed calls, notifications, etc.
  • the SIM card interface 195 is used to connect a SIM card.
  • the SIM card can be connected to or separated from the electronic device 100 by inserting it into the SIM card interface 195 or pulling it out from the SIM card interface 195 .
  • the electronic device 100 can support 1 or N SIM card interfaces, where N is a positive integer greater than 1.
  • SIM card interface 195 can support Nano SIM card, Micro SIM card, SIM card, etc. Multiple cards can be inserted into the same SIM card interface 195 at the same time. The types of the plurality of cards may be the same or different.
  • the SIM card interface 195 is also compatible with different types of SIM cards.
  • the SIM card interface 195 is also compatible with external memory cards.
  • the electronic device 100 interacts with the network through the SIM card to implement functions such as calls and data communications.
  • the electronic device 100 uses an eSIM, that is, an embedded SIM card.
  • eSIM card can be embedded in In the electronic device 100, it cannot be separated from the electronic device 100.
  • the software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiment of this application takes the Android system with a layered architecture as an example to illustrate the software structure of the electronic device 100 .
  • FIG. 2 is a software structure block diagram of the electronic device 100 according to the embodiment of the present application.
  • the layered architecture of the electronic device 100 divides the software into several layers, and each layer has clear roles and division of labor.
  • the layers communicate through software interfaces.
  • the Android system is divided into four layers, from top to bottom: application layer, application framework layer, Android runtime and system libraries, and kernel layer.
  • the application layer can include a series of application packages.
  • the application package can include camera, gallery, calendar, call, map, recorder, WLAN, Bluetooth, music, video, short message and other applications.
  • the application framework layer provides an application programming interface (API) and programming framework for applications in the application layer.
  • API application programming interface
  • the application framework layer includes some predefined functions.
  • the application framework layer can include a window manager, content provider, view system, phone manager, resource manager, notification manager, etc.
  • a window manager is used to manage window programs.
  • the window manager can obtain the display size, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and retrieve data and make this data accessible to applications.
  • Said data can include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
  • the view system includes visual controls, such as controls that display text, controls that display pictures, etc.
  • a view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide communication functions of the electronic device 100 .
  • call status management including connected, hung up, etc.
  • the resource manager provides various resources to applications, such as localized strings, icons, pictures, layout files, video files, etc.
  • the notification manager allows applications to display notification information in the status bar, which can be used to convey notification-type messages and can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also be notifications that appear in the status bar at the top of the system in the form of charts or scroll bar text, such as notifications for applications running in the background, or notifications that appear on the screen in the form of conversation windows. For example, text information is prompted in the status bar, a beep sounds, the electronic device vibrates, the indicator light flashes, etc.
  • the system library and runtime layer include system libraries and Android Runtime.
  • System libraries can include multiple functional modules. For example: surface manager (surface manager), media libraries (Media Libraries), 3D graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
  • the 3D graphics library is used to implement three-dimensional graphics drawing, image rendering, composition and layer processing, etc.
  • the Android runtime includes core libraries and a virtual machine. The Android runtime is responsible for the scheduling and management of the Android system.
  • the core library contains two parts: one is the functional functions that need to be called by the Java language, and the other is the core library of Android.
  • the application layer and application framework layer run in virtual machines. The virtual machine will Execute the java files in the program layer and application framework layer as binary files. The virtual machine is used to perform object life cycle management, stack management, thread management, security and exception management, and garbage collection and other functions.
  • the surface manager is used to manage the display subsystem and provides the fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, composition, and layer processing.
  • 2D Graphics Engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
  • the components included in the system framework layer, system library and runtime layer shown in Figure 2 do not constitute specific limitations on the electronic device 100.
  • the electronic device 100 may include more or fewer components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently.
  • the cloud server uses AI (Artificial Intelligence, artificial intelligence) technology to convert the recording files into text, and records the relationship between the recording and the text. correspondence between.
  • AI Artificial Intelligence, artificial intelligence
  • the cloud server then sends the transcribed text and the correspondence between the recording and the text to the electronic device.
  • the user can operate the recorder application to play the recording, so that the display interface of the mobile phone is the display interface 101 .
  • the display interface 101 may include one or more controls, including but not limited to: the converted text 106 for the recording, the playback pause control 102, the current playback time control 105, the playback duration control 104, the playback progress control 103, etc. .
  • the total duration of the recording is 1 minute as shown in the playback duration control 104 .
  • the double vertical line icon of the play pause control 102 indicates that the recording is in a playing state.
  • the user clicks on the "color" character in the text 106 to locate the audio.
  • the mobile phone can respond to the user's click operation and locate the audio position corresponding to the text segment to which the "color" character belongs based on the above correspondence relationship.
  • the audio position here is 0 minutes and 30 seconds.
  • the mobile phone can move the playback progress control 103 to the position of 0 minutes and 30 seconds, and update the display content of the current playback time control 105 to the current playback time (or playback progress), which is 0 minutes and 30 seconds here.
  • the mobile phone can make the text segment to which the character clicked by the user bold, deepen the color, or change the color to distinguish the unselected text.
  • the text selected here is "The spring scenery in the park is intoxicating, with cuckoos hiding on the branches of the mango trees, and groups of thrushes squatting on the branches of the poplar trees like a wedding team.”
  • the 0th minute and 30th second in the recording is the start playback time of the selected text.
  • the mobile phone can locate the audio position based on the user's click operation in the text converted to the recording, play the recording corresponding to the selected text, and guide the user to modify the text.
  • the recordings need to be uploaded to the cloud server after the recording is completed. Taking a recording document of about 5 minutes as an example, it takes about 10 seconds to upload to the cloud and about 1 minute and 40 seconds to transcribe it into editable text. The longer the recording time is, the longer it takes for the cloud server to convert the recording to text, which affects the efficiency of recording to text.
  • the text segment positioned by the electronic device is a sentence to which the character selected by the user belongs (for example, a sentence ending with a period, exclamation mark, etc., the sentence includes Multiple text fragments ending with commas), instead of the text fragment "The spring scenery in the park is intoxicating," to which the "color" character selected by the user belongs. Then the text length of sentences is generally longer, which is not conducive for users to modify the text based on positioning audio.
  • the electronic device of this application provides a recording-to-text method, which can convert the recording into text locally on the electronic device, and supports forward fixed-point playback and reverse fixed-point playback of the recording, so that the user can play according to the fixed point.
  • recording to perform text operations such as adding, deleting, and modifying the corresponding text content.
  • the modified text still supports the above-mentioned forward fixed-point playback and reverse fixed-point playback, which can improve the efficiency of recording to text and the efficiency of text proofreading.
  • forward fixed-point playback means the user selects the text segment that needs to be listened to again in the transcribed text, and the electronic device can adjust the playback progress of the recording to the playback time of the text segment and play the recording.
  • Reverse fixed-point playback means after the user adjusts the playback progress of the recording (for example, dragging the playback progress bar of the recording), the electronic device can locate the text segment corresponding to the playback progress in the text (the text converted to the recording) and play it recording.
  • FIG. 4a is a flowchart illustrating a fixed-point playback method.
  • the flow chart may include: S101, S103, and S105.
  • the mobile phone obtains the original recording data of the audio.
  • the original recording data may include the text length L (for example, the number of characters) of the text segment in the audio-converted text and the timestamp T of the audio segment corresponding to the text segment.
  • the timestamp T may be the starting playback time or the end playback time of the audio clip.
  • the method of this embodiment can be applied to audio-to-text scenarios in various languages such as Chinese, English, Japanese, Korean, etc.
  • This application uses Chinese as an example for explanation.
  • the method is the same. Here No longer.
  • FIG. 4b exemplarily shows the implementation process of S101
  • FIG. 5 is a schematic diagram of an application scenario of the electronic device.
  • the mobile phone is installed with an application, and the application integrates an audio recording (or collection) function and a text editing function.
  • This application adds a recording-to-text function and a fixed-point playback function to the application.
  • the application may be a recorder application (with audio recording, recording to text, text editing and fixed-point playback functions), or an instant messaging application (with audio recording, recording to text, text editing and fixed-point playback functions), etc.,
  • This application does not limit this.
  • This method can also be applied to at least two applications.
  • the recorder application installed in the mobile phone can have audio recording functions and recording to text functions.
  • the memo application installed in the mobile phone can have text editing functions and fixed-point playback functions. Through the recorder application and the memo application Interaction is used to implement the technical solution of this application. The specific implementation details are the same as those implemented using a single application, and will not be described again here.
  • the process may include the following steps:
  • the mobile phone starts the audio collection and audio-to-text functions of the application.
  • the display interface 401 of the mobile phone includes one or more controls.
  • the controls include but are not limited to: power controls, network controls, and application icons.
  • the user can click the icon 402 of the audio recorder application in the display interface 401 to start the audio recorder application.
  • the display interface of the mobile phone is switched from display interface 401 to display interface 403.
  • the display interface 403 may include one or more controls.
  • the controls may include but are not limited to: a control 404 for searching for recording files, a control 406 for starting recording, an option control 405 for converting recording to text, etc.
  • Option control 405 includes switch control 4051.
  • the switch control 4051 is in a closed state.
  • the recorder application can record in response to the user operation, and will not record the real-time recording during the recording process. Convert audio to text.
  • the recorder application can respond to the user operation, as shown in Figure 5(3), by setting the switch control 4051 to the on state, so that , the audio recorder app starts the audio-to-text function. Then, the user clicks the control 406 to start recording in Figure 5(3), and the audio recorder application can start the audio collection function.
  • the recording interface 501 may include one or more controls.
  • the controls may include a recording progress control 506 and a current recording duration control 505.
  • the current recording duration control 505 shows that the duration of the recorded audio is 5 seconds.
  • the recording interface 501 may also include a mark control 502, which may be used to add a mark to a recording node of interest during the recording process, so that the user can locate the audio playback progress through the mark.
  • the recording interface 501 may also include a pause recording control 504 for pausing recording of currently recorded audio.
  • the recording interface 501 may also include an end recording control 503 to terminate this recording.
  • S202 The application converts the real-time collected audio into text.
  • the audio recorder application can collect audio in real time and convert the real-time collected audio into text.
  • the audio recorder application can convert the recorded audio into text during the recording process.
  • Any audio-to-text algorithm may be used to implement the voice recorder application's conversion of audio into text, and this application does not limit this.
  • S203 The application determines whether the text is a temporary result.
  • the recorder application when the recorder application converts real-time recorded audio into text, the recorder application can use the audio-to-text algorithm to sequentially output multiple temporary results (which are the texts converted from the recordings), where the temporary results output later are the results of the previous output.
  • the correction and update of the output temporary results can be understood as the correction of the converted text.
  • the temporary results before the update are no longer displayed. That is, at the same time, the recorder application only outputs and displays one Provisional results. For example, the recorder application can use the temporary results output later to refresh and display the temporary results output earlier.
  • the recorder application refreshes and displays the temporary results multiple times, when it detects that the semantics of the most recently output temporary result are complete, it can use the most recently output temporary result as the final result (as the recording An audio segment in the audio converted into text) output.
  • the display mode of the most recently output temporary result can be changed to a display mode corresponding to the final result to remind the user that the current output result is the final result.
  • the most recently output temporary result can be refreshed as the final result according to the display mode of the final result.
  • the recorder application can clear the relevant data of multiple temporary results output in this round, and save a final result and its mapping relationship between L and T (see below for details).
  • the real-time collected audio continues to be converted into text, thereby outputting multiple temporary results and a final result in sequence again.
  • This cycle continues until the recording ends.
  • the recorder application has completed the text conversion of the recorded audio.
  • the converted text of the complete recording is composed of multiple final results.
  • the display content of the voice recorder application interface will be multiple final results, excluding temporary results.
  • the two results are displayed in different ways.
  • the font size of the temporary results is smaller than the font size of the final results
  • the text of the temporary results is in italics
  • the text of the final results is Upright, etc.
  • the specific display method is not limited.
  • the final result is a semantically complete text segment, which may include one or more punctuation marks.
  • the determination condition for semantically complete text is determined by an audio-to-text algorithm, which is not limited by this application.
  • the accuracy of the temporary results output by the audio recorder application can be correlated with the accuracy of the audio-to-text algorithm.
  • the accuracy of the audio-to-text algorithm is a single character, so every time the recorder application converts the recorded audio into a character (such as a word, word, symbol, etc.), it will generate and output a temporary result.
  • the recording-to-text algorithm can add punctuation marks to the converted text when a pause is detected in the audio, and output the text with added punctuation marks as an updated temporary result.
  • the text corresponding to the user's voice is “Hello!.
  • the temporary results output are: “you”, “hello”, “hello!.
  • the last output temporary result "Hello! is semantically complete, and the final result is "Hello!.
  • the text displayed on the recording interface of the voice recorder application is sequentially updated to: “You”, “Hello”, “Hello!. After outputting a temporary result, if you output the next
  • the recorder application can display the temporary results and final results obtained in real time on the recording interface 501 shown in Figure 5 (4) and Figure 5 (5), for example, in the current recording duration control 505 an area below, this application does not limit this.
  • the recording interface 501 shows a final result that has been converted, and the final result is A temporary result that is output later.
  • the text content of the final result here is the text shown in the larger dotted box in the recording interface 501 "Recently, the reporter saw a group of children in Qinghu Park.”
  • the temporary result is the text "In the teacher" shown in the smaller dotted box in the recording interface 501 .
  • the dotted box here is only used to illustrate the temporary result and the final result. When the recording interface 501 displays any of the temporary result and the final result, the dotted box shown here is not displayed.
  • the font size of the temporary result is smaller than the font size of the final result, and the temporary result is in italic font, so that the user can distinguish which text in the currently output converted text is the final converted text (final result) and which text is in the middle of the conversion Result (temporary result).
  • the temporary results and the final results are displayed on the recording interface 501, they can be displayed in different display modes to facilitate the user to distinguish the temporary results and the final results. In this way, during the recording process, the user can browse the text content converted from the recorded audio in real time on the recording interface.
  • the accuracy of the audio-to-text algorithm in this embodiment is a single character.
  • temporary result 0 temporary result 1
  • temporary result 2 After outputting a final result shown in the recording interface 501, temporary result 0, temporary result 1, and temporary result 2 are output in sequence.
  • the text corresponding to temporary result 0 is "zai” in order
  • the text corresponding to temporary result 1 is “zai lao” in order
  • the text corresponding to temporary result 2 is "teacher in” in order.
  • the temporary results output first will no longer be displayed. Therefore, only the text content corresponding to one temporary result 2 is shown in Figure 5(4).
  • the recorder application every time the recorder application updates a temporary result, the temporary result displayed in the recording interface 501 is updated and displayed.
  • the recorder application can update the last output temporary result to the final result.
  • the recorder application has output multiple final results on the recording interface 501.
  • the text of the multiple final results can be found in The text content within the dotted box shown in Figure 5(5) will not be described again here.
  • the dotted box in Figure 5(5) is only used to illustrate multiple final results, and in actual application, the dotted box is not displayed.
  • S204 The application records the timestamp T of each audio clip in real time based on the punctuation marks in the temporary results.
  • sentence-breaking punctuation marks are preset punctuation marks used to indicate sentence breaking.
  • sentence-breaking punctuation marks may include but are not limited to: comma, period, exclamation mark, question mark, semicolon, pause, etc. Symbols such as colons and parentheses do not have the meaning of sentence segmentation and may not be used as sentence segmentation punctuation marks.
  • the timestamp T can also be the current recording duration corresponding to the starting time point or the ending time point of the audio segment.
  • an audio segment is the audio recorded from the 1st second to the 10th second of the current recorded audio. fragment, then the timestamp T of the audio fragment can be 1 second or 10 seconds.
  • the timestamp T may be the system time of the mobile phone corresponding to the starting time point or the ending time point of the audio clip, such as Beijing time. For example, if an audio clip is recorded from 14:00:00 to 14:00:10 Beijing time, the timestamp of the audio clip can be 14:00:00 Beijing time. , or, 14:00:10 Beijing time.
  • the timestamp T may also be other types of time information used to identify the recording progress of the audio clip, and is not limited to the current recording duration or system time as exemplified above.
  • the output time of each temporary result is the current time of audio collection.
  • the recorder application can obtain the current time of audio collection (such as the start time of the above-mentioned audio clip) when detecting that the temporary result includes punctuation marks. point or the system time corresponding to the end time point, or the current recording duration) to obtain the timestamp corresponding to the sentence punctuation mark in the temporary result.
  • FIG. 6a is a schematic diagram illustrating multiple temporary results and timestamps output in sequence.
  • Pi represents the i-th punctuation mark in a temporary result
  • ti represents the timestamp T of Pi.
  • i is an integer starting from 0, and there is no limit to the maximum value of i.
  • each temporary result is not shown. It is represented here by a line segment.
  • the sentence punctuation marks in the temporary results are shown by black dots.
  • the arrows pointing to the black dots illustrate the sentence punctuation marks.
  • the timestamp ti in each temporary result in Figure 6a represents the timestamp of the i-th sentence punctuation mark in the temporary result.
  • Figure 6a(1) and Figure 6a(4) both include timestamp t0.
  • the value of timestamp t0 may be Differently, timestamp t0 is only used to represent the timestamp of the 0th sentence punctuation mark in the temporary result.
  • the sorting when sorting a certain object, the sorting starts from the 0th one.
  • This is a sorting principle based on computer language.
  • natural language understanding for example, computers
  • the 0th sentence-breaking punctuation mark in language is the 1st sentence-breaking punctuation mark in natural language understanding.
  • the recorder application detects that the temporary result 0 includes a sentence punctuation mark P0.
  • the recorder application outputs temporary result 1.
  • the recorder application detects that the temporary result 1 includes two sentence punctuation marks (for example, the symbol corresponding to P0 is a comma, and the symbol corresponding to P1 is a period), then the recorder application can detect The temporary result 1 output this time is compared with the temporary result output last time. Here is the temporary result 0. If the number of sentence punctuation marks has changed, and the number of sentence segmentation punctuation marks has increased, the recorder application can obtain the current recording.
  • the duration t1 (for example, 2s) is used as the timestamp of the last sentence punctuation mark (P1 here) in the temporary result, and the mapping relationship between P1 and t1 is continued to be recorded.
  • the updated mapping relationship includes P0 corresponding to 1s, and P1 corresponding to in 2s.
  • the recorder application outputs temporary result 2.
  • the recorder application detects that the temporary result 2 includes a sentence punctuation mark (for example, the symbol corresponding to P0 is a comma), then the recorder application can detect the temporary result 2 output this time. Compared with the temporary result 1 output last time, the number of sentence-breaking punctuation marks is reduced.
  • the recorded mapping relationship about the temporary results includes P0 corresponding to t0 and P1 corresponding to t1. Then the recorder application can delete the mapping relationship corresponding to the last punctuation mark recorded, that is, the mapping relationship between P1 and t1 here.
  • the scene changing from Figure 6a(1) to Figure 6a(2) may be that the audio-to-text algorithm determines that a sentence punctuation mark should be added to the text after P0, thereby adding P1.
  • Another example is the scene that changes from Figure 6a(2) to Figure 6a(3).
  • the audio-to-text algorithm can determine that there should be no sentence breaks at the P1 position, so that the symbol corresponding to P1 in the temporary result 1 is deleted to output the temporary result 2. .
  • the recorder application can obtain the current recording duration (for example, 2.5s) as the timestamp t1 of the last sentence-breaking punctuation mark in this temporary result 3 to continue recording.
  • the updated mapping relationship includes P0 corresponding to 2.1s and P1 corresponding to 2.5s.
  • the timestamp of the last punctuation mark (comma here) in temporary result 3 is 2.5s.
  • the timestamp t0 corresponding to the comma is 2.1s, but in the mapping relationship corresponding to temporary result 3, the exact timestamp of the comma is 2.5s, which leads to a certain error in the timestamps of partial sentence punctuation marks in the updated temporary results.
  • the position of the sentence punctuation mark added in the temporary result output this time is before the last punctuation mark in the temporary result output last time. This error does not affect the overall audio-to-text conversion.
  • the accuracy of the timestamp is 2.1s, but in the mapping relationship corresponding to temporary result 3, the exact timestamp of the comma is 2.5s, which leads to a certain error in the timestamps of partial sentence punctuation marks in the updated temporary results.
  • it is rare that the position of the sentence punctuation mark added in the temporary result output this time is before the last punctuation mark in the temporary result output last time. This error does not affect the overall audio-to-text conversion.
  • the accuracy of the timestamp is 2.1s,
  • the recorder application After the recorder application outputs the temporary result 3, the recorder application continues to output the temporary result 4 that corrects the temporary result 3. If the recorder application detects that the temporary result 4 includes a sentence-breaking punctuation mark (such as a comma), the recorder application can detect that the temporary result 4 output this time has reduced the number of sentence-breaking punctuation marks compared to the temporary result 3 output last time. Before outputting the temporary result 4, the recorded mapping relationship about the temporary results includes P0 corresponding to 2.1s and P1 corresponding to 2.5s.
  • a sentence-breaking punctuation mark such as a comma
  • the recorder application can delete the mapping relationship corresponding to the last punctuation mark recorded, that is, delete P1 and t1 in Figure 6a(4) (here is 2.5s ) mapping relationship.
  • the recorder application can also obtain the current recording duration (for example, 2.55s), and update the value of t0 corresponding to P0 in the mapping relationship shown in Figure 6a(4) to the current recording duration, then Figure 6a(5) As shown, the updated mapping relationship includes P0 corresponding to t0 (here 2.55s).
  • the recorder application continues to output the next temporary result, temporary result 5 as shown in Figure 6a(6).
  • the recorder application detects that the temporary result 5 includes 2 sentence punctuation marks (for example, the symbol corresponding to P0 is comma, the symbol corresponding to P1 is a period), the recorder application can detect that the temporary result 5 output this time has increased compared to the temporary result 4 output last time. The number of punctuation marks for sentence breaks has been increased, and the recorder application can obtain the current recording duration.
  • the updated mapping relationship includes P0 corresponding to 2.55s, P1 Corresponds to 2.6s.
  • the acquisition of the timestamp is not triggered, nor is the update of the mapping relationship between Pi and ti triggered.
  • every time the recorder application outputs a temporary result it can detect whether the temporary result contains Include sentence-breaking punctuation marks. If sentence-breaking punctuation marks are included, the recorder application can obtain the number of sentence-breaking punctuation marks. When there is a temporary result output before the temporary result output this time, the recorder application can obtain the timestamp (such as the current recording duration) as the timestamp of the last sentence punctuation mark in the temporary result of this output.
  • the timestamp corresponding to the punctuation marks of the sentence can be updated according to the arrangement order of the punctuation marks of the sentence.
  • the recorder application can also obtain the timestamp (such as the current recording duration), and, after the last recorded sentence punctuation mark, In the mapping relationship between the arrangement order of symbols and timestamps, delete the mapping relationship between the last sentence punctuation mark and its timestamp in the order. After the deletion operation, synchronize the above-mentioned timestamp obtained this time to the above-mentioned mapping relationship at this moment. , at the timestamp corresponding to the last sentence-breaking punctuation mark (that is, the second-to-last sentence-breaking punctuation mark before the deletion operation).
  • the timestamp such as the current recording duration
  • the accuracy of the audio-to-text algorithm is a single character, that is, every time a character is updated, a temporary result is output, in which the sentence punctuation mark is also a character. Then between the last two temporary results, when the number of sentence punctuation marks changes, generally the number will only increase or decrease by one. Therefore, as long as one sentence segmentation punctuation mark is added or subtracted in the converted text, the recorder application will A temporary result can be output. Moreover, the positions of the sentence-breaking punctuation marks that increase or decrease in number in this temporary result are generally located before or after the last sentence-breaking punctuation mark in the last output temporary result.
  • Figures 6a(1) to 6a(6) are intended to reflect the main scenarios in which the number of sentence punctuation marks is updated. In practical applications, Figures 6a(1) to 6a(6) respectively represent Scenarios where the number of sentence punctuation marks increases or decreases does not necessarily occur continuously. Figure 6a is only used as an example to illustrate how the recorder application of this application updates the timestamp of the corresponding sentence punctuation marks in the temporary results when the number of sentence segment punctuation marks is updated. It is not intended to limit this application.
  • the recorder application detects that the semantics of the provisional result 5 are complete. As shown in Figure 6a(7), the recorder application can output the provisional result 5 as the final result.
  • S205 The application records the timestamp T of each audio segment and the text length L of the text segment corresponding to the audio segment based on the final result.
  • the mapping relationship between Pi and ti corresponding to the latest temporary result can be used as the mapping relationship between Pi and ti for the final result, as shown in Figure 6a ( 6) P0 shown corresponds to t0 (eg 2.55s), and P1 corresponds to t1 (eg 2.6s).
  • the recorder application can then record the number of characters of each text fragment based on the position of Pi in the final result in the mapping relationship between Pi and ti.
  • the recorder application can calculate the characters from the starting position of the final result to P0 corresponding to the timestamp t0 (that is, the first sentence punctuation mark represented by the natural language, the corresponding symbol here is a comma)
  • the number is l0, where the comma corresponding to P0 can be counted within the number of characters l0.
  • the recorder application can calculate the number of characters l1 from P0 to P1 (including P1) in the final result, thereby generating each text segment (or each audio segment) in the recording-to-text shown in Figure 6a(8)
  • the mapping relationship between L and T includes the mapping relationship between l0 and t0, and the mapping relationship between l1 and t1.
  • the first sentence represents the first sentence of the audio-converted text that includes a punctuation mark.
  • Text snippet represents a text fragment that includes a sentence-breaking punctuation mark after the first sentence.
  • Figure 6b is an exemplary illustration of the original recording data generated after converting the audio to text, that is, the mapping relationship between L and T of each sentence.
  • L of the first sentence (for example, a text fragment) is l0, which is the number of characters of the first sentence.
  • the first sentence corresponds to
  • the timestamp of the audio clip is t0, where t0 is the end time of the audio clip corresponding to the first sentence.
  • T of an audio segment can also be set as the starting time of the audio segment.
  • L of the first sentence is l0
  • T is 0
  • L of the second sentence is l1
  • T is t0.
  • the recorder application when it detects that a punctuation mark appears in the temporary result, it can record the timestamp of the punctuation mark in real time, thereby recording the timestamp of the text segment including the punctuation mark.
  • the timestamp of the latest temporary result record will be dumped, and the number of characters in each sentence in the final result will be recorded. Then the timestamp generated based on the temporary results is accurate, and the number of characters generated based on the final results is accurate. Therefore, based on the temporary results and the final results, the timestamp T of each audio segment in the recording-to-text and the corresponding The mapping relationship between the number of characters L of the audio fragment and the text fragment.
  • the recorder application can persistently store the mapping relationship between ti and li corresponding to the final result to the local file system after each output of the final result.
  • the mapping relationship between ti and li corresponding to each final result can be persistently stored in the local file system. This application does not limit this.
  • the mobile phone can generate original recording data while converting the recording into text.
  • the mobile phone can segment text fragments according to sentence punctuation marks, and record the mapping relationship between the timestamp T and the number of characters L of each text fragment (such as each sentence) to generate original record data.
  • the original recorded data has information about the timestamp T and the number of characters L, which can be used as a basis for calculating the fixed-point playback position when the mobile phone performs fixed-point playback, thereby achieving accurate fixed-point playback.
  • the recording process and the process of converting the recording into text are carried out simultaneously, and the functions of audio recording and text editing are integrated in the same application, which can realize recording and conversion at the same time. , which is also the end of converting the recording to text. This eliminates the need to convert the recording to text after the recording ends, making the conversion of recording to text more efficient.
  • one set of temporary results corresponds to one final result.
  • multiple sets of temporary results and corresponding multiple final results can be obtained.
  • the data between different sets of temporary results are independent of each other.
  • the data between different final results are independent of each other. Then, when comparing whether the number of sentence punctuation marks has changed between two temporary results, the temporary results within a group corresponding to the same final result are compared, without comparing with the temporary results of other groups.
  • S206 The application determines whether an operation to stop audio collection is received.
  • the recorder application can determine whether the recording has ended. If the recording has not ended, then go to S201 and continue to perform the above steps in a loop until the recording ends.
  • the recorder application may execute S201 to S206 shown in Figure 4b in a loop.
  • the recorder application can receive the operation of stopping audio collection, and then go to Execute S207.
  • the display interface 403 may include one or more controls, which may include the recording result control 406.
  • the recording result control 406 may include a recording name control (the recording name here is "Recording 1"), a recording time control (here is March 1, 2022), and a playback recording control 4061.
  • the display interface 601 includes one or more controls.
  • the control may include the recording 1 shown in Figure 8a(1), the text 603 converted in real time during the real-time recording process, and the playback progress bar control 602.
  • the playback progress bar control 602 includes a playback progress bar 6023, a playback progress control 6024, Pause control 6025, current play time control 6021, and audio duration control 6022.
  • the audio duration of recording 1 is 1 minute as shown by audio duration control 6022
  • the current playback progress of recording 1 is 0 minutes and 0 seconds as shown by current playback time control 6021.
  • the recording 1 is currently in the playing state.
  • the mobile phone determines at least one fixed-point playback position based on the original recorded data.
  • the fixed-point playback operation can be divided into forward fixed-point playback and reverse fixed-point playback.
  • the user can click a certain position in the text 603, or select half a sentence, or a sentence, or multiple consecutive sentences, here the text includes a sentence punctuation mark.
  • the text includes a sentence punctuation mark.
  • the mobile phone can determine the fixed-point playback position based on the original record data (such as the original record data shown in Figure 6b), for example, which sentence in the original record data corresponds to the text clicked by the user.
  • Embodiment 1 The user clicks a single position in the text
  • the recorder application obtains the coordinate Q(x, y) of the position in the displayed text (for example, text 603 shown in Figure 8a(2)).
  • the recorder application obtains the total word count of the text before the above position, offsetCount, based on the coordinates Q(x,y).
  • the recorder application is based on the original recorded data and calculates in a loop according to the order of each sentence in the original recorded data.
  • totalCount(n-1) l0+l1+l2+l3...+l(n-1).
  • the original recorded data shown in Figure 6b has a total of n sentences of L and T.
  • the recorder application can determine whether the currently calculated totalCount(i) is greater than or equal to the above offsetCount. If the currently calculated totalCount(i) is less than the above offsetCount, continue with the next calculation of totalCount(i+1). If the currently calculated totalCount(i) is greater than or equal to the above offsetCount, then the statement index(i) traversed in the currently calculated totalCount(i) is determined to be the fixed-point play position.
  • the recorder application can determine that the user's click position is in the fourth sentence shown in Figure 6b, where the fourth sentence is identified as index (3).
  • Embodiment 2 The user selects half a sentence, one sentence, or multiple consecutive sentences in the text converted from the audio.
  • the recorder application can obtain the starting position and the ending position in the target text selected by the user, and the corresponding coordinates Q1 (x, y) in the text converted from the audio (such as the text 603 shown in Figure 8a(2)). and coordinates Q2(x,y).
  • the recorder application obtains the total word count of the text before the starting position, offsetCount1, based on the coordinates Q1 (x, y); and obtains the total word count of the text before the end position, offsetCount2, based on the coordinates Q2 (x, y).
  • the recorder application determines the starting position to be at the corresponding fixed-point playback position 1 (for example, index(i)) in the original recording data based on offsetCount1; and based on the original recording data, based on offsetCount2, determines the ending position to be at the original recording The corresponding fixed-point playback position 2 in the data (for example, index(j)).
  • the recorder application can determine the index of the start sentence and the end sentence in the original recorded data, thereby determining that the starting position belongs to the i+1th sentence in the text converted from the recording, and the ending position belongs to the i+1th sentence in the text. j+1 sentence.
  • the user can click a certain position in the playback progress bar 6023, or drag the playback progress control 6024 to the position to change the playback progress of the recording.
  • the mobile phone can respond to the user operation and determine the fixed-point playback position based on the original record data (such as the original record data shown in Figure 6b).
  • the audio playback position selected by the user in the playback progress bar 6023 corresponds to the original record data. The first few sentences.
  • the recorder application obtains the current playback time progressTime corresponding to the user's adjusted playback progress of recording 1.
  • the user clicks a certain position in the playback progress bar 6023 shown in Figure 8a(2), or drags the playback progress control 6024 to the position.
  • the recorder application can obtain the current playback time progressTime corresponding to the position in response to the user operation.
  • the recorder application traverses the ti of each sentence based on the original recorded data and the order of each sentence in the original recorded data, where ti is the end playback time of the i+1th sentence, for example, as shown in Figure 6b shows that i can start from 0, i
  • the maximum value is (n-1).
  • index(i) index(3), where index(3) is the identifier of the fourth sentence, and the fixed-point playback position can be determined to be the fourth sentence.
  • the mobile phone updates the audio playback progress and identifies the target text segment in the text based on at least one fixed-point playback position and the original recorded data.
  • the at least one fixed-point playback position is the identifier index(i) of a sentence determined in the original recorded data
  • the mobile phone can move the playback progress control 6024 along the playback progress bar 6023.
  • the current playback time corresponding to the moved position of the playback progress control 6024 is t2.
  • the current playback time control 6021 shows The time will be updated to t2.
  • the mobile phone can also display the fourth sentence in text 603 in a preset display mode.
  • the preset display mode may be different from the display mode of other texts in the text 603 except the fourth sentence.
  • the target text fragment here is "Feel the picturesque spring with white flowers in full bloom.”
  • the difference between the preset display mode and the original display mode of the fourth sentence may be at least one display mode such as font size, font, font color, font shadow, font background color, etc.
  • This application does not make this difference. limit.
  • the default display mode can be that the font background color is blue.
  • the at least one fixed-point play position when the user selects a piece of text, the at least one fixed-point play position includes an identifier index(i) and an identifier index(j).
  • the mobile phone can move the playback progress control 6024 along the playback progress bar 6023.
  • the current playback time corresponding to the moved position of the playback progress control 6024 is t1.
  • the current playback time control 6021 shows The time will be updated to t1.
  • the mobile phone can also display the third and fourth sentences in text 603 in the default display mode.
  • the third sentence here includes "Use young hands to draw the park scenery," and the fourth sentence Includes “Feel the picturesque spring with white flowers blooming.”.
  • the preset display mode may be different from the display mode of other texts in the text 603 except the third sentence and the fourth sentence.
  • Embodiment 3 when the mobile phone executes S105, the implementation manner is the same as the implementation manner in the above-mentioned Embodiment 1. For details, reference may be made to the specific description of implementing S105 in Embodiment 1, which will not be described again here.
  • the mobile phone can convert the recorded audio into real-time text during the real-time recording process, and generate the above-mentioned original recording data in real time.
  • the original recording data can include L (number of characters) and timestamp of each sentence. T (such as start play time, or end play time).
  • the mobile phone can determine the fixed-point playback position based on the position of the fixed-point operation and the original recorded data, for example, which sentence in the original recorded data the position of the user's fixed-point operation belongs to. In this way, the L and T of the sentence in the original recorded data and the L and T of the previous sentence are combined to perform fixed-point playback of the audio and highlight the positioned text, thereby realizing forward fixed-point playback and reverse playback of the recorded audio. Play to a fixed point.
  • the recorder application in the above embodiment may include an audio control module, a text view module, and a fixed-point playback module.
  • Figure 7a is a schematic diagram of the data processing process of the recorder application in an exemplary forward fixed-point playback scenario. It can be understood in conjunction with the scene diagrams of Figure 6b and Figures 8a to 8f.
  • the process may include the following steps:
  • the audio control module obtains the original recording data of the recording.
  • the specific implementation process of this step may refer to the descriptions of the related embodiments in FIG. 4b, FIG. 5, and FIG. 6a and FIG. 6b, which will not be described again here.
  • the text view module responds to the fixed-point playback operation of the text converted from the recording and obtains the total number of words of the text before the click position according to the coordinates of the click position.
  • the mobile phone after the mobile phone displays the display interface 601 shown in Figure 8a(2) to play the recording, and after playing the recording for a period of time, the mobile phone can display the display interface 601 shown in Figure 8b(1) so that the current The current playback time shown by the playback time control 6021 is updated to 0 minutes and 5 seconds, and the position of the playback progress control 6024 on the playback progress bar 6023 moves.
  • the user can perform a click operation on text 603, and then when the text view module executes S303, it can be implemented through the above-mentioned Embodiment 1.
  • the text view module executes S303, it can be implemented through the above-mentioned Embodiment 1.
  • the specific implementation process please refer to the introduction to Embodiment 1.
  • the user reads the text 603 shown in Figure 8b(1) and finds a place where the sentence is not smooth, and clicks the position with a single finger. For example, position 1 in the user text 603, the text corresponding to position 1 is "white", Then the user's click operation can generate a click event, and the text view module can process the click event to obtain the coordinates (x1, y1) of position 1. Then, the text view module can obtain the total number of words offsetCount of the text before position 1 in the text 603 according to the coordinates (x1, y1).
  • the user can select half a sentence, a sentence, or multiple consecutive sentences for the text 603.
  • the text view module executes S303, it can be implemented through the above-mentioned Embodiment 2.
  • the specific implementation process please refer to the introduction of Embodiment 2.
  • FIG. 8c is a schematic diagram illustrating a scenario in which the user selects half a sentence of the text 603.
  • the implementation process of the scene in which the user selects a sentence in the text 603 is similar to the process of the scene in Figure 8c, and will not be described again here.
  • the mobile phone can display the display interface 601 as shown in Figure 8c(1).
  • the user can select text 1 (here it is "Blooming scenery is picturesque"), then the text The position of the starting character in 1 is position 2 where the character "SHENG" is located, and the position of the ending character is position 3 where the character "HUA" is located.
  • the user's operation of selecting text can generate two click events, namely the click event on position 2 and the click event on position 3.
  • the text view module can handle the two click events to obtain the coordinates of position 2 (x2, y2) and the coordinates of position 3 (x3, y3).
  • the text view module can obtain the total number of words offsetCount1 of the text before position 2 in text 603 according to the coordinates (x2, y2), and obtain the total number of words of the text before position 3 in the text 603 according to the coordinates (x3, y3). Word count offsetCount2.
  • FIG. 8d is a schematic diagram illustrating a scenario in which the user selects multiple consecutive sentences on the text 603.
  • the mobile phone can display the display interface 601 as shown in Figure 8d(1), and the user can select text 2 (for specific text content, refer to Figure 8d(1) ) shows text 2, which will not be described here), then the position of the starting character in text 2 is position 4 where the character " ⁇ " is located, and the position of the ending character is position 5 where the character " ⁇ " is located. Then the user's operation of selecting text can generate two click events, namely the click event on position 4 and the click event on position 5. Then the text view module can handle the two click events to obtain the coordinates of position 4 (x4, y4) and the coordinates of position 5 (x5, y5).
  • the text view module can obtain the total word count offsetCount3 of the text before position 4 in text 603 according to the coordinates (x4, y4), and obtain the total number of words of the text before position 5 in text 603 according to the coordinates (x5, y5). Word count offsetCount4.
  • the fixed-point playback module performs fixed-point calculation based on the total word count of the text and the original recorded data, and determines at least one fixed-point playback position.
  • the fixed-point playback module can determine that position 1 in Figure 8b(1) belongs to the fourth sentence in text 603, that is, the fixed-point playback position is index(3), where index(3 ) is the identifier of the fourth sentence.
  • the user can select half a sentence, one sentence, or multiple consecutive sentences for the text 603.
  • the fixed-point playback module executes S305, it can be implemented through the above-mentioned Embodiment 2.
  • the specific implementation process Please refer to the introduction to Embodiment 2.
  • the fixed-point playback module can determine the position 2 in Figure 8c(1) and the position based on the total word count offsetCount1 and the total word count offsetCount2 mentioned in the example of the scene of Figure 8c(1). 3 all belong to the fourth sentence in text 603, that is, the fixed-point playback position is index (3), where index (3) is the identifier of the fourth sentence.
  • the fixed-point playback module can determine the position 4 in Figure 8d(1), which belongs to the third position in the text 603, based on the total word count offsetCount3 mentioned in the example of the scene of Figure 8d(1). Two sentences, then one The fixed-point playback position is index(1), and index(1) is the identifier of the second sentence in the original recorded data. And the fixed-point playback module can determine position 5 in Figure 8d(1) based on the total word count offsetCount4 mentioned in the example of the scene in Figure 8d(1), which belongs to the fifth sentence in the text 603, then another fixed-point playback position is index(4), index(4) is the identifier of the fifth sentence in the original record data.
  • the fixed-point playback module updates the playback progress and identifies the target text segment in the text.
  • the fixed-point playback module can determine that the click position is in the recording 1 based on the original record data and the position 1 shown in Figure 8b(1).
  • the identifier of the corresponding statement in the original record data here is index(3).
  • the identifier index(3) is used to identify the fourth sentence in the text 603, and the fixed-point playback module can obtain the L and T of the fourth sentence from the original record data
  • the corresponding L of the fourth sentence is l3, and the corresponding T is t3, where t3 is the end time of the audio segment corresponding to the fourth sentence (such as the end playback time).
  • t2 as shown in Figure 6b is the fourth sentence.
  • the starting time of the audio segment corresponding to the sentence (such as the start play time), where t2 is 0 minutes and 20 seconds shown in the current play time control 6021.
  • the fixed-point playback module can move the playback progress control 6024 along the playback progress bar 6023, and the playback progress control 6024 moves
  • the current playback time corresponding to the last position is 0 minutes and 20 seconds.
  • the time shown in the current playback time control 6021 is updated from 0 minutes and 5 seconds shown in Figure 8b(1) to the start of the fourth sentence in text 603.
  • Time t2 here is 0 minutes and 20 seconds.
  • the fixed-point playback module can also determine the target text segment corresponding to the fourth sentence in the text 603 based on the number of characters L in each sentence from the first sentence to the fourth sentence recorded in the original recording data, And the target text fragment is displayed in the display interface 601 in a bold and italic manner to distinguish it from the display manner of other unselected texts in the text 603 .
  • the target text fragment here (for example, the fourth sentence) is "Feel the picturesque spring when white flowers are in full bloom.” to remind the user of the text content to be played at the determined point.
  • the fixed-point playback module when responding to the user's fixed-point playback operation, can not only update the playback progress and identify the selected target text segment, but also set the playback state of the recording to the paused playback state, as shown in Figure 8b (
  • the playback pause control 6025 in 2) shows a triangular icon, which is used to indicate that the recording is in a paused playback state.
  • the double vertical line icon shown in the playback pause control 6025 in Figure 8b(1) is used to indicate that the recording is in the playback state. In this way, the user can flexibly choose the timing to play the target text segment as needed.
  • the user can select half a sentence, one sentence, or multiple consecutive sentences for the text 603 shown in Figure 8c or Figure 8d.
  • the fixed-point playback module can use the above implementation Way 2 to achieve.
  • the fixed-point playback module can determine the starting sentence corresponding to the selected text based on the original record data and position 2 shown in Figure 8c(1).
  • the identifier in the original record data here is index(3).
  • the fixed-point playback module can determine the identification of the termination sentence corresponding to the selected text in the original recording data based on the original recording data and position 3 shown in Figure 8c(1).
  • the identification here is also index(3), indicating that the user has selected One sentence.
  • the identifier index(3) is used to identify the fourth sentence in the text 603,
  • the fixed-point playback module can obtain the L and T of the fourth sentence from the original recorded data.
  • the corresponding L of the fourth sentence is l3, and the corresponding T is t3, where t3 is the end time of the audio clip corresponding to the fourth sentence. (for example, the end play time).
  • t2 as shown in Figure 6b is the starting time of the audio segment corresponding to the fourth sentence (for example, the start play time).
  • t2 is 0 minutes as shown by the current play time control 6021. 20 seconds.
  • the fixed-point playback module can move the playback progress control 6024 along the playback progress bar 6023, and the playback progress control 6024 moves
  • the current playback time corresponding to the last position is 0 minutes and 20 seconds.
  • the time shown in the current playback time control 6021 is updated from 0 minutes and 5 seconds shown in Figure 8c(1) to the start of the fourth sentence in text 603.
  • Time t2 here is 0 minutes and 20 seconds.
  • the fixed-point playback module can also determine the character number L in each sentence from the first sentence to the fourth sentence recorded in the original recording data to determine the corresponding character in the text 603.
  • the target text fragment of the fourth sentence is displayed in the display interface 601 in bold italics to distinguish it from the display mode of other unselected texts in the text 603 .
  • the target text fragment here (for example, the fourth sentence) is "Feel the picturesque spring when white flowers are in full bloom.” to remind the user of the text content to be played at the determined point.
  • the user selects multiple consecutive sentences in the text 603, and the fixed-point playback module can determine the starting sentence in the multiple sentences based on the original record data and position 4 shown in Figure 8d(1).
  • the identifier index(1) and the fixed-point playback module can determine the identification index (4) of the termination sentence in the multiple sentences based on the original record data and the position 5 shown in Figure 8d(1).
  • the identifier index(1) is used to identify the second sentence in the text 603.
  • the fixed-point playback module can obtain the L and T of the second sentence from the original recorded data.
  • the corresponding L is l1
  • the corresponding T is t1, where t1 is the end time of the audio segment corresponding to the second sentence (such as the end play time).
  • t0 as shown in Figure 6b is the end time of the audio segment corresponding to the second sentence.
  • the starting time of the audio clip (such as the start play time), where t0 is 0 minutes and 2 seconds.
  • the identifier index (4) is used to identify the fifth sentence in the text 603, the fixed-point playback module can obtain the L and T of the fifth sentence from the original record data, and the fifth sentence
  • the corresponding L of the sentence is l4, and the corresponding T is t4, where t4 is the end time of the audio segment corresponding to the fifth sentence (for example, the end playback time).
  • the fixed-point playback module can move the playback progress control 6024 along the playback progress bar 6023, and the playback progress control 6024 moves
  • the current playback time corresponding to the last position is 0 minutes and 2 seconds.
  • the time shown in the current playback time control 6021 is updated from 0 minutes and 5 seconds shown in Figure 8d(1) to the second sentence in the text 603 (that is, selected The starting playback time t0 of the starting sentence in multiple sentences), here is 0 minutes and 2 seconds.
  • the fixed-point playback module can also determine the character number L in each sentence from the first sentence to the fifth sentence recorded in the original recording data to determine the corresponding character in the text 603.
  • the third sentence and the fourth sentence in between correspond to the target text fragments.
  • the audio control module receives a playback operation.
  • the user can click the click play pause control 6025 shown in any of the above three figures, then the audio
  • the control module can receive the playback operation.
  • the triggering method of the playback operation is not limited to the click playback pause control 6025 illustrated here.
  • the recorder application can also automatically play the positioned target audio clip after S307 without the user triggering the playback operation.
  • the fixed-point playback module identifies the corresponding target text segment in the text based on the target audio segment played in real time.
  • the user clicks a certain position in the text 603.
  • the user clicks the playback pause control 6025 the recorder application can respond to the user operation, starting from the playback time of 0 minutes and 20 seconds (here is the start of a sentence positioned by clicking the position). Playback of the recording begins at the playback time).
  • the start playback time of the fourth sentence "Feel the picturesque spring with blooming white flowers.” as the target audio clip is t2, and the value of t2 It is 0 minutes and 20 seconds shown by the current play time control 6021 in Figure 8b (3), the end play time of the fourth sentence is t3, and the value of t3 is shown by the current play time control 6021 in Figure 8b (4) 0 minutes and 25 seconds.
  • the value of the starting playback time t3 of the fifth sentence is also 0 minutes and 25 seconds.
  • the fifth sentence is "The spring scenery in the park is intoxicating.”
  • the fixed-point playback module may display the display interface 601 as shown in FIG. 8b(3) in response to the user clicking the playback pause control 6025 in FIG.
  • the recording of the fourth sentence starts playing at the starting playback time of the sentence (here, 0 minutes and 20 seconds), and the fourth sentence in the text 603 is displayed in bold italics.
  • the fixed-point playback module can The display mode of the fourth sentence is restored to the original display mode (for the display effect, please refer to the display mode of the fourth sentence in Figure 8b(4)).
  • the fixed-point playback module can set the display mode of the next target audio clip to be played (the fifth sentence here) to the default display mode (such as bold and italic display mode).
  • the display mode of the fifth sentence is restored to the original display mode, and the display mode of the sixth sentence is set to the preset display mode... until the playback of text 603 ends, that is, the current playback time control 6021 shows The current playback time is 1 minute.
  • the user can click on any position in the converted text, and the mobile phone can play the target audio clip to which the click position belongs based on the position corresponding to the click operation, and The target text fragment at the clicked position is displayed in the preset display mode. Moreover, after the target text segment is played, the display mode of the target text segment is restored to the original display mode, and the mobile phone can also automatically play the audio segment corresponding to the next sentence as the target audio segment, and the next sentence as The target text fragment is displayed in the default display mode. Then the user only needs to click on any position in the converted text, and the recorder application can automatically locate the target text fragment after the target text fragment selected by the user, realizing automatic recording of each subsequent sentence. Fixed-point playback allows users to check each subsequent sentence at any time to see if there are text conversion errors, which helps users correct text errors and improve text editing efficiency.
  • the user can select half a sentence, one sentence, or multiple consecutive sentences for the text 603 shown in Figure 8c or Figure 8d.
  • the fixed-point playback module can automatically fix the point when playing the positioned sentence.
  • the process of playback and automatic fixed-point playback please refer to the relevant descriptions of Figure 8b(3) and Figure 8b(4), which will not be described again here.
  • the user clicks the playback and pause control 6025, and the recorder application can also respond to the user operation, such as As shown in Figure 8d(3), the audio clips from the second to fifth sentences selected are continued to be played from the position of 0 minutes and 2 seconds shown by the current playback time control 6021, and the second sentence in the text 603 is Sentences to the fifth sentence are displayed in a preset display mode.
  • the subsequent playback process can refer to the relevant descriptions of the embodiments in Figure 8b(3) and Figure 8b(4), which will not be used here.
  • the user can perform a forward fixed-point playback operation, and the recorder application can automatically perform fixed-point playback of each sentence after the target text segment corresponding to the forward fixed-point playback operation.
  • the fixed-point playback module may not perform S310, but only perform fixed-point playback of the target text segment and the target audio segment corresponding to the text selected by the user in text 603.
  • the audio clip of the fourth sentence in the text 603 (here, "Feel the picturesque spring with white flowers in full bloom.") can be played, in After the audio clip of the fourth sentence is played, the recording 1 is automatically paused so that the playback pause control 6025 displays a triangular icon to indicate that the recording 1 is in a paused playback state.
  • the recording 1 is automatically paused so that the playback pause control 6025 displays a triangular icon to indicate that the recording 1 is in a paused playback state.
  • the recorder application can also display the current playback time control 6021 as the end playback time of the fourth sentence, that is, t3.
  • the recorder application can automatically pause the playback of the recording 1, so that the playback pause control 6025 displays a triangular icon to indicate that the recording 1 is in a paused playback state.
  • the current playback time control 6021 shows the playback end time t4 of the fifth sentence, which is 0 minutes and 30 seconds here.
  • the mobile phone when the user clicks on the converted text to achieve forward fixed-point playback, for example, in the scene of Figure 8a, the mobile phone can automatically play fixed-point playback at the corresponding click position. After the target audio clip is played, the next audio clip is automatically played, and the text clip corresponding to the next audio clip is displayed in the preset display mode, and this cycle is continued until the entire recording is played.
  • the user when the user needs to automatically play at a fixed point, he can click anywhere in the converted text to facilitate user operations in an automatic fixed-point playback scenario.
  • the mobile phone when the user selects at least two characters (such as half a sentence, a sentence, or multiple consecutive sentences) of the converted text, the mobile phone performs forward fixed-point playback.
  • the target audio clip belongs to the text selected by the user, it is preferable to play the target audio clip to which the text selected by the user belongs, and the target text clip to which the text selected by the user belongs is displayed in a preset display mode. After the target audio clip is played, the mobile phone will no longer continue to play at a fixed point.
  • the mobile phone can automatically pause and play the recording, or the mobile phone can play the target audio segment in a loop and during the loop playback process, the target text segment is always displayed in a preset display mode.
  • the mobile phone when the user selects at least two characters in the converted text, it means that the user may be currently only interested in the target text fragment corresponding to the at least two characters and needs to modify it, then the mobile phone can only select The target audio segment and the target text segment to which the character belongs are played or highlighted, thereby facilitating the user to listen to the played target audio segment and correct the characters in the target text segment.
  • the audio control module receives a pause playback operation.
  • the playback pause control 6025 is shown to pause the playback of recording 1.
  • S311 may be performed after S310. For example, when the user re-listens to the target audio clip played at a fixed point in the forward direction and determines that there are areas that need to be modified in the target text clip corresponding to the target audio clip, the user triggers the above-mentioned pause playback operation to pause the playback of recording 1.
  • the text view module edits text.
  • the user can add, delete, and modify the text corresponding to the displayed audio according to the played audio content to make the sentences smooth.
  • the user clicks on position 6 in the text 603 in the display interface 601.
  • the position 6 is the position between the characters "Friends” and the character “ ⁇ ” in the second sentence of the text 603.
  • the recorder application can obtain the coordinates (x6, y6) of position 6 in text 603. Then, the recorder application can determine the identifier index (i) of position 6 corresponding to the statement in the original recording data based on the solution of Embodiment 1, where The identifier determined by the recorder application is index(1), that is, the positioned sentence is the second sentence.
  • the user adds a comma "," at position 6, so that the number of characters in the second sentence in text 603 is increased by one.
  • the text view module refreshes the original record data in reverse.
  • the user's editing operation on the text causes the number of characters in the second sentence in the text 603 to be updated, then the text view module can record 1 (or text 603) The value of L in the second sentence marked index(1) in the original record data is incremented by one to refresh the original record data in reverse.
  • the number of characters of the statement (identified as index(i)) corresponding to the edited text is compared to the number of characters of the statement identified as index(i) in the original record data.
  • the character number li of the corresponding statement in the original record data is updated according to the number of characters of the edited statement labeled index(i).
  • Example 1 when the user edits text in text 603 between two sentences, for example, the user clicks on the link between the text "recently,” and the text "reporter” in text 603 shown in Figure 8e(1)
  • the recorder is used to determine which sentence the user's click position belongs to in the original recorded data based on the target position.
  • the fixed-point playback position that can be determined by the recorder application is the first sentence.
  • the editing operation when the number of characters L of the target text segment (such as a sentence defined in the present application) in the original record data is updated according to the user's editing operation on the text, the editing operation is between two sentences. During the sentence, the number of characters added by the editing operation can be appended to the number of characters L of the previous sentence to improve the user experience.
  • the number of characters added by the editing operation is appended to the number of characters L of the next sentence, because the position corresponding to the two sentences (for example, the first sentence and the second sentence) in the recording may not have audio data.
  • the user makes the recorder application play the audio clip of the second sentence in the forward direction, then the audio clip does not contain the text content added by the above editing operation in the second sentence.
  • the corresponding audio data will cause the beginning of the output audio clip to not match the semantics of the beginning of the second sentence, thus affecting the user's editing operation of the text and causing a poor user experience.
  • Example 2 When the user's editing operation on the text is to add characters at the very beginning of the text converted from audio (that is, before the first sentence), the number of added characters can be appended to the original record data of the text. , the number of characters in the first sentence is L.
  • the number of characters in the first sentence is L.
  • the user added the target text "Reported by our reporter," before the text "Recently,” in Figure 8e(1).
  • the number of characters of the target text is 7, and the number of characters l0 of the first sentence of text 603 is 3.
  • the recorder application can update the number of characters L of the corresponding statement when refreshing the original recorded data in reverse, the timestamp T corresponding to the statement does not change.
  • the user when editing text, the user can also select the text that needs to be edited (or click a certain character), instead of clicking in the text as shown in Figure 8e a certain position (no character exists at this position).
  • the user can select text 1 (here, the text "Blooming scenery is picturesque"), and the recorder application can display the control 604 on the display interface 601 in response to the user's operation of selecting the text (at least one character).
  • the control 604 can be displayed in the vicinity of multiple sentences selected by the user. This application does not limit the display position of the control 604 in the display interface 601.
  • Controls 604 may include "Copy” options, "Cut” options, "Select All” Options, "Translate” options, "Play” options, "Share” options.
  • the voice recorder application can also display the control 604 only when the user selects at least two characters in the converted text (for example, selects any scene of half a sentence, a sentence, or multiple consecutive sentences).
  • the user may click the "play" option in the control 604 instead of clicking the play pause control 6025.
  • the "play" option in control 604 it can be used to only play the target audio segment to which at least two selected characters belong, and not to play audio segments other than the target audio segment.
  • the recorder application can copy the text 1 selected in Figure 8f(1).
  • the audio recorder application can perform a cutting operation on the text 1 in Figure 8f(1).
  • This cutting operation causes the number of characters L in the fourth sentence corresponding to text 1 to change.
  • the number of characters l3 in the fourth sentence is updated to 8.
  • the text corresponding to the fourth edited sentence is "Feel the spring of white flowers.”, and the number of characters is 8.
  • the recorder application can perform a select-all operation on text 1 in Figure 8f(1).
  • the recorder application can perform a translation operation on the text 1 in Figure 8f(1), such as Chinese to English, English to English, etc.
  • the source language and target language of the translation can be pre-set Configuration, there are no restrictions here.
  • the recorder application can determine the fixed-point play position for text 1 in Figure 8f(1) according to the method of the above-mentioned embodiment 2. For example, text 1 belongs to the original recorded data. Which sentence is the fourth sentence in this example? Then, the display interface 601 shown in Figure 8f(2) is displayed to perform fixed-point playback of the fourth sentence.
  • the implementation principle of forward fixed-point playback from Figure 8f(1) to Figure 8f(2) is similar to the solution shown in Figure 8c and will not be described again here.
  • control 604 is the same as that introduced in the embodiment of Figure 8f, and will not be described again here.
  • the voice recorder application of the embodiment of the present application can obtain and store the original recording data while converting the recording into text. After finishing the recording-to-text operation, the user clicks on the semantically incoherent sentences in the converted text or selects a few specified sentences to play the audio clip in the forward direction. For example, the sentence being played can be blue. Color to clearly remind the user. Users can perform editing operations such as additions, deletions, and modifications to the text based on the played audio content. When the editing operation causes a change in the number of characters in the corresponding sentence, the recorder application can refresh the original recorded data to ensure subsequent forward fixed-point playback or reverse fixed-point playback. In the scene, audio clips played at fixed points and text clips displayed in the default display mode are still accurate.
  • the electronic device can locate the audio segment corresponding to a sentence based on the original recorded data about the mapping relationship between T and L through the user's operation position (such as the cursor position) on the converted text. and text fragments, thereby determining the specific position of the audio fragment in the complete recording, and the specific position of the text fragment in the complete text, enabling forward fixed-point playback.
  • the electronic device can also use the user's input to the converted text At least two selected characters (half a sentence or a sentence or multiple consecutive sentences) are located based on the original recorded data about the mapping relationship between T and L, so as to locate at least one audio segment and at least one text segment. Determining the specific position of at least one audio segment in the complete recording and the specific position of at least one text segment in the complete text enables forward fixed-point playback.
  • Figure 7b is a schematic diagram of the data processing process used by the recorder in an exemplary reverse fixed-point playback scenario, which can be understood in conjunction with the scene diagrams of Figure 6b and Figures 8a and 8g.
  • the process may include the following steps:
  • the audio control module obtains the original recording data of the recording.
  • the audio control module obtains the fixed-point playback time in response to the fixed-point playback operation of the recording playback progress bar.
  • the display interface of the mobile phone switches to the display interface 601 as shown in Figure 8g(1), and the playback progress control 6024 is located on the playback progress bar 6023 corresponding to the progress p1.
  • the position and progress p1 are the current playback time shown in the current playback time control 6021, which is 0 minutes and 5 seconds here.
  • the user drags the playback progress control 6024 in the direction of the arrow from the progress p1 to the position corresponding to the progress p2 shown in Figure 8g(2).
  • the progress p2 is as shown in Figure 8g(2).
  • the current playback time control 6021 shows the current playback time, here is 0 minutes and 21 seconds. Then the user's operation of dragging the playback progress control 6024 can cause a change in the current playback time of recording 1, thereby triggering the execution of the update callback function.
  • the recorder application executes the update callback function, it can obtain the current playback time progressTime (an example of the fixed-point playback time described in S302 above) based on the playback progress control 6024 and the current position on the playback progress bar 6023 (for example, progress p2). Here it is 0 minutes and 21 seconds.
  • the fixed-point playback module performs fixed-point calculation based on the above-mentioned fixed-point playback time and original recorded data, and determines the fixed-point playback position.
  • the fixed-point playback module updates the playback progress and identifies the target text segment in the text.
  • the identifier index (3) is used to identify the fourth sentence in the text 603, fixed point
  • the playback module can obtain the L and T of the fourth sentence from the original recorded data.
  • the corresponding L of the fourth sentence is l3, and the corresponding T is t3, where t3 is the end time of the audio segment corresponding to the fourth sentence ( For example, the end play time).
  • t2 as shown in Figure 6b is the starting time of the audio segment corresponding to the fourth sentence (for example, the start play time).
  • t2 is 0 minutes and 20 shown by the current play time control 6021.
  • the fixed-point playback module can adjust the position of the playback progress control 6024 in the playback progress bar 6023, Switch from the position corresponding to progress p2 to the position corresponding to progress p3.
  • the progress p3 may be the current playback time shown in the current playback time control 6021 in Figure 8g(3), which is 0 minutes and 20 seconds here.
  • the fixed-point playback module will also The current play time shown in the play time control 6021 is switched from 0 minutes and 21 seconds to 0 minutes and 20 seconds, thereby adjusting the current play time to the starting play time t2 of the positioned fourth sentence.
  • the fixed-point playback module can also determine the target text segment corresponding to the fourth sentence in the text 603 based on the number of characters L in each sentence from the first sentence to the fourth sentence recorded in the original recording data, And the target text fragment is displayed in the display interface 601 in a bold and italic manner to distinguish it from the display manner of other unselected texts in the text 603 .
  • the target text fragment here (for example, the fourth sentence) is "Feel the picturesque spring when white flowers are in full bloom.” to remind the user of the text content to be played at the determined point.
  • the fixed-point playback module when responding to the user's fixed-point playback operation, can not only update the playback progress and identify the selected target text segment, but also set the playback state of the recording to the paused playback state, as shown in Figure 8g (The playback pause control 6025 in 3) shows a triangular icon, which is used to indicate that the recording is in a paused playback state. In this way, the user can flexibly choose the timing to play the target text segment as needed.
  • the audio control module receives a playback operation.
  • the fixed-point playback module identifies the corresponding target text segment in the text based on the target audio segment played in real time.
  • the audio control module receives a pause playback operation.
  • the text view module edits text.
  • the text view module refreshes the original record data in reverse.
  • the voice recorder application of the embodiment of the present application can obtain and store the original recording data while converting the recording into text.
  • the user can drag the playback progress bar to reverse the fixed-point playback.
  • the sentence being played can be displayed in blue to explicitly prompt the user.
  • Users can perform editing operations such as additions, deletions, and modifications to the text based on the played audio content.
  • the recorder application can refresh the original recorded data to ensure subsequent forward fixed-point playback or reverse fixed-point playback.
  • audio clips played at fixed points and text clips displayed in the default display mode are still accurate.
  • the recording can be converted into text locally on the electronic device, and the text can be converted in real time during the recording process to improve the accuracy of the timestamps of each sentence in the original recorded data. accuracy.
  • the text can be converted in real time during the recording process to improve the accuracy of the timestamps of each sentence in the original recorded data. accuracy.
  • the user detects that there is unsound text in the converted text of the recording, the user can click on the text that needs to be positioned or drag the playback progress bar to achieve fixed-point playback of the text and recording.
  • the fixed-point playback The accuracy is high, which can improve the proofreading efficiency and editing efficiency of the text of the recording file.
  • the technical solution implemented by the above-mentioned electronic device of the present application can also be applied to a system.
  • the system can include a first electronic device and a second electronic device that are communicatively connected, wherein the first electronic device has an audio recording function.
  • the second electronic device has a recording to text function, a text editing function and a fixed-point playback function.
  • the system can be applied to a distributed microphone scenario, where the distributed microphone represents the microphones of at least two electronic devices that are communicatively connected.
  • the first electronic device is a recording pen
  • the second electronic device is a mobile phone, tablet computer, or notebook computer.
  • the recording pen can record audio in real time and send the real-time recorded audio. to the tablet, which can perform text conversion on the audio received in real time to obtain the original recorded data.
  • the user can operate the tablet to play the recording to display the text content corresponding to the recording.
  • users can operate the recording or text to achieve forward fixed-point playback and reverse fixed-point playback, as well as editing operations such as proofreading of text.
  • the principle of the specific implementation of the technical solution of the present application is similar to the implementation of the solution applied to one electronic device, and will not be described again here.
  • the system implements the technical solution of the present application.
  • One user can use the first electronic device to record audio on site, and another user can use the second electronic device to simultaneously receive the real-time recorded audio and converted text, which is beneficial to improving text editing efficiency. and text proofreading efficiency.
  • the electronic device includes corresponding hardware and/or software modules that perform each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving the hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions in conjunction with the embodiments for each specific application, but such implementations should not be considered to be beyond the scope of this application.
  • FIG. 9 shows a schematic block diagram of a device 300 according to an embodiment of the present application.
  • the device 300 may include: a processor 301 and a transceiver/transceiver pin 302, and optionally, a memory 303.
  • bus 304 which includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.
  • bus 304 includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.
  • various buses are referred to as bus 304 in the figure.
  • the memory 303 may be used for instructions in the foregoing method embodiments.
  • the processor 301 can be used to execute instructions in the memory 303, and control the receiving pin to receive signals, and control the transmitting pin to send signals.
  • the device 300 may be the electronic device or a chip of the electronic device in the above method embodiment.
  • This embodiment also provides a computer storage medium that stores computer instructions.
  • the computer instructions When the computer instructions are run on an electronic device, the electronic device executes the above related method steps to implement the data processing method in the above embodiment.
  • This embodiment also provides a computer program product.
  • the computer program product When the computer program product is run on a computer, it causes the computer to perform the above related steps to implement the data processing method in the above embodiment.
  • embodiments of the present application also provide a device, which may be a chip, component or module.
  • the device may include a connected processor and a memory; wherein the memory is used to store computer execution instructions.
  • the processor may execute the computer execution instructions stored in the memory to cause the chip to perform the data processing in each of the above method embodiments. method.
  • the electronic equipment, computer storage media, computer program products or chips provided in this embodiment are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they can achieve can be referred to the corresponding methods provided above. The beneficial effects of the method will not be repeated here.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of modules or units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined or can be integrated into another device, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separate.
  • a component shown as a unit may be one physical unit or multiple physical units, that is, it may be located in one place, or it may be distributed to multiple different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • Integrated units may be stored in a readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.
  • the technical solutions of the embodiments of the present application are essentially or contribute to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium , including several instructions to cause a device (which can be a microcontroller, a chip, etc.) or a processor to execute all or part of the steps of the methods of various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code.
  • the steps of the methods or algorithms described in connection with the disclosure of the embodiments of this application can be implemented in hardware or by a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules.
  • Software modules can be stored in random access memory (Random Access Memory, RAM), flash memory, read only memory (Read Only Memory, ROM), erasable programmable read only memory ( Erasable Programmable ROM (EPROM), electrically erasable programmable read-only memory (Electrically EPROM, EEPROM), register, hard disk, removable hard disk, compact disc (CD-ROM) or any other form of storage media well known in the art.
  • An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage media may be located in an ASIC. Additionally, the ASIC can be located in a network device. Of course, the processor and storage media can also exist as discrete components in the network device.
  • Computer-readable media includes computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • Storage media can be any available media that can be accessed by a general purpose or special purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Telephone Function (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)

Abstract

一种数据处理方法及电子设备,涉及终端设备技术领域,方法包括:在录音过程中,对实时录制的音频进行文本转换,可提升文本转换效率,以及在录音转文本的过程中获取关于时间戳与字符数量的原始记录数据,以用于定点播放音频,可提升音频和文本的定点播放的准确度。

Description

数据处理方法及电子设备
本申请要求于2022年03月31日提交中国国家知识产权局、申请号为202210335616.9、申请名称为“数据处理方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及终端设备技术领域,尤其涉及一种数据处理方法及电子设备。
背景技术
在一些场景中,例如听课、会议、采访等场景,用户常常需要使用电子设备来对现场语音进行录音,并在录音结束后将录音转换为文本。在录音结束后,用户对转换的文本具有修改需求,用户可通过操作电子设备重放录音来指导用户修改文本。
目前,电子设备只可以在录音结束后,对录音转换为文本,使得录音转文本的效率较低。且通过重放录音来修正文本时,电子设备容易存在定位不准确的问题。
发明内容
为了解决上述技术问题,本申请提供一种数据处理方法及电子设备。在该方法中,可录音过程中进行文本转换,可提升文本转换效率,以及在录音转文本的过程中获取关于时间戳与字符数量的原始记录数据,以用于定点播放音频,可提升音频和文本的定点播放的准确度。
第一方面,本申请实施例提供一种数据处理方法,该方法可应用于电子设备。该方法包括:响应于接收到的第一用户操作,在将音频数据转换为文本数据的过程中,获取第一信息;其中,所述音频数据为实时采集的音频数据,所述第一信息包括第一音频片段的第一时间戳与第一文本片段的第一字符数量的第一映射关系,其中,所述第一文本片段为所述第一音频片段的第一文本转换结果,所述音频数据包括至少一个所述第一音频片段,所述文本数据包括至少一个所述第一文本片段,所述第一时间戳为用于标识所述第一音频片段的起始时间点或结束时间点的时间戳;响应于接收到的第二用户操作,基于所述第一信息,将所述音频数据的播放进度更新至第二音频片段的第一起始时间点,以及将第二文本片段以预设显示方式显示;其中,所述第二文本片段为所述第二音频片段的第二文本转换结果;所述第二音频片段包括至少一个所述第一音频片段,所述第二文本片段包括至少一个所述第一文本片段。
示例性的,所述第一信息中的第一映射关系涉及的文本片段是对应的音频片段的最终文本转换结果(简称最终结果),例如第一文本转换结果指代第一音频片段的最终文本转换结果,同理,第二文本转换结果指代第二音频片段的最终文本转换结果。
示例性的,第二用户操作可以是对录制后的音频数据进行播放进度调整的操作,以实现反向定点播放,或,对该音频数据转换后的文本数据的操作以实现正向定点播放。
示例性的,第二用户操作的形式还可以是实现上述正向定点播放或反向定点播放的语音输入,本申请对此不做限制。
这样,本实施例可在实时采集音频的过程中,对实时采集的音频转换为文本,并在音频转文本的过程中,获取到第一信息(又称原始记录数据),其中,原始记录数据中可包括音频片段的时间戳与文本片段的字符数量的映射关系,其中,该文本片段即为电子设备对该该音频片段进行文本转换时的最终结果。这样,原始记录数据中的时间戳和字符数量均较为准确,那么利用原始记录数据,来对已录制完成的音频进行定点播放时,可较为准确的定位到需要播放的第二音频片段,以及在对音频数据已转换后的文本数据中,能够较为准确的定位到该第二音频片段的最终文本转换结果,能够准确的定位音频和文本。并且,录音转文本可在电子设备侧完成,可提升对录音转文本的效率。
根据第一方面,所述第一文本片段中的最后一个字符为预设标点符号,其中,所述预设标点符号为语义表示断句的标点符号。
示例性的,预设标点符号可包括但不限于:逗号、句号、感叹号、问号、分号、顿号等。
本实施例中,在对该音频数据转换成文本后的最终文本数据中可包括至少一个第一文本片段,每个第一文本片段中的最后一个字符为该预设标点符号,以此来实现对音频的文本转换,以及便于利用预设标点符号,来确定对应的至少一个第一音频片段的第一时间戳,可提升原始记录数据中第一时间戳的准确度。
根据第一方面,或者以上第一方面的任意一种实现方式,所述在将音频数据转换为文本数据的过程中,获取第一信息,包括:在将所述音频数据转换为所述文本数据的过程中,检测到所述音频数据的第三文本转换结果的类型为中间结果,基于所述第三文本转换结果中的所述预设标点符号,获取第二时间戳;基于所述第二时间戳,记录或更新与所述中间结果对应的,排列次序与时间戳的第一对应关系;其中,所述第二时间戳用于标识所述第三文本转换结果的生成时间,所述排列次序用于表示所述中间结果中所述预设标点符号的排列次序;在将所述音频数据转换为所述文本数据的过程中,检测到所述音频数据的第四文本转换结果的类型为最终结果,基于第二对应关系和所述第四文本转换结果,获取所述第一映射关系;其中,所述第二对应关系为与最近一次检测到的中间结果,对应的所述第一对应关系。
示例性的,在音频转文本的过程中,可对待转换的音频片段转换为临时结果(即中间结果),并且临时结果可迭代更新,在临时结果语义完整时,则将最近一次转换得到的临时结果作为最终结果(即最终文本转换结果)。
示例性的,在所述音频数据的第四文本转换结果的类型为最终结果时,则该第四文本转换结果与最近一次检测到的第三文本转换结果相同,即最终结果的文本内容与最近一次检测到的临时结果的文本内容相同。
示例性的,电子设备所使用的音频转文本算法的精度可为单个字符,即每次临时结果中增加或减少一个字符(单个字、单个词或符号等),就会输出一次更新后的临时结果。示例性的,临时结果的生成时间与临时结果的输出时间可以相同,那么可基于临时结果 中的预设标点符号,来获取第二时间戳(例如当前音频采集时长,示例性的为当前录音时长)。在音频数据的文本转换结果的类型为最终结果时,则电子设备可基于最近一次检查到的临时结果对应的排列次序与时间戳的对应关系,来生成原始记录数据中的至少一个第一映射关系。在录音结束时,则可生成针对录制的音频数据以及对该音频数据转换的文本数据(包括至少一个最终结果)的原始记录数据。该实施例可提升原始记录数据中各第一映射关系的准确性。
根据第一方面,或者以上第一方面的任意一种实现方式,所述基于所述第三文本转换结果中的所述预设标点符号,获取第二时间戳,包括:检测到所述第三文本转换结果包括所述预设标点符号,且所述第三文本转换结果为首个中间结果,获取所述第二时间戳;或,检测到所述第三文本转换结果中所述预设标点符号的第一数量,大于上一次的第三文本转换结果中所述预设标点符号的第二数量,获取所述第二时间戳;或,检测到所述第一数量小于所述第二数量,获取所述第二时间戳。
示例性的,在实时录音转文本的过程中,可依次生成一组至少一个临时结果(即中间结果),以及与该至少一个临时结果对应的最终结果。在生成本次最终结果之后,可继续生成下一组至少一个临时结果,以及与该下一组至少一个临时结果对应的下一个最终结果,那么依次得到的最终结果即为音频数据的最终文本转换结果。那么在每生成一组至少一个临时结果时,对于生成的首个临时结果,且该首个临时结果包括预设标点符号,则可获取当前录音时长,以得到第二时间戳。
在同一组的临时结果中,如果本次临时结果与上一次临时结果之间的预设标点符号的数量不同,则同样可触发获取当前录音时长,以得到第二时间戳。
电子设备在获取到第二时间戳后,则可用于对临时结果的最新的第一映射关系继续更新,以提升原始记录数据中第一时间戳的准确度。
根据第一方面,或者以上第一方面的任意一种实现方式,所述基于所述第二时间戳,记录或更新与所述中间结果对应的,排列次序与时间戳的第一对应关系,包括:在与所述中间结果对应的所述第一对应关系中,记录或增加一条最后一个排列次序与所述第二时间戳的对应关系;或,在与所述中间结果对应的所述第一对应关系中,将最后一个排列次序与时间戳的对应关系删除,以更新所述第一对应关系,并将更新后的所述第一对应关系中,与当前最后一个排列次序对应的时间戳更新为所述第二时间戳。
示例性的,每次记录或更新临时结果的第一对应关系时,可增加一条记录。
例如第一对应关系包括排列次序0与时间戳0的对应关系,以及排列次序1与时间戳1的对应关系,那么在对第一对应关系增加一条最后一个排列次序与第二时间戳(例如时间戳2)的对应关系时,可在该第一对应关系中增加排列次序2与时间戳2的对应关系。这里的排列次序2即为上述最后一个排列次序的一个示例。
例如第一对应关系包括排列次序0与时间戳0的对应关系,以及排列次序1与时间戳1的对应关系,可将最后一个排列次序与时间戳的对应关系删除,即将排列次序1与时间戳1的对应关系删除,那么更新后的第一对应关系包括排列次序0与时间戳0的对应关系,然后,可将所述更新后的第一对应关系中当前最后一个排列次序(这里为排列 次序0)对应的时间戳(这里为时间戳0)更新为第二时间戳(这里为时间戳2),使得该更新后的第一对应关系中排列次序0对应于时间戳2。
根据第一方面,或者以上第一方面的任意一种实现方式,所述基于第二对应关系和所述第四文本转换结果,获取所述第一映射关系,包括:基于所述第二对应关系中的排列次序,确定所述第四文本转换结果中所述至少一个第一文本片段各自的所述第一字符数量;基于所述第二对应关系中相互对应的排列次序与时间戳,确定与所述第四文本转换结果对应的音频数据中,所述至少一个第一音频片段各自的所述第一时间戳;基于所述第二对应关系中的排列次序,获取所述第一时间戳与所述第一字符数量的第一映射关系,其中,排列次序相同的所述第一时间戳与所述第一字符数量相互映射。
例如,第二对应关系包括:排列次序0对应于时间戳2,排列次序1对应于时间戳3。这里的第四文本转换结果的类型为最终结果,那么基于该第二对应关系可知,第四文本转换结果可包括2个预设标点符号,例如第四文本转换结果为“你好!我叫张三。”那么可按照预设标点符号来对第四文本转换结果,确定第一文本片段及其第一字符数量,那么该第四文本转换结果对应的排列次序为0的第一字符数量l0为3(即文本“你好!”的字符数量),排列次序为1的第一字符数量l1为5(即文本“我叫张三。”的字符数量)。同理可确定该第四文本转换结果的转换前的音频数据中排列次序为0的第一音频片段的时间戳为时间戳2,排列次序为1的第一音频片段的时间戳为时间戳3,基于第二对应关系中的排列次序,可得到第0个第一文本片段的第一字符数量l0与第0个第一音频片段的时间戳2的第一映射关系,以及第2个第一文本片段的第一字符数量l1与第1个第一音频片段的时间戳3的第一映射关系。
根据第一方面,或者以上第一方面的任意一种实现方式,所述响应于接收到的第二用户操作,基于所述第一信息,将所述音频数据的播放进度更新至第二音频片段的第一起始时间点,以及将第二文本片段以预设显示方式显示,包括:响应于接收到的所述第二用户操作,确定所述第一信息中的至少一个第一映射关系;基于所述至少一个第一映射关系和所述音频数据,确定至少一个第三音频片段,其中,所述至少一个第三音频片段中所述第一时间戳最早的第三音频片段为所述第二音频片段;基于所述至少一个第一映射关系和所述文本数据,确定至少一个第三文本片段,所述第二文本片段包括所述至少一个第三文本片段;基于所述第一信息,将所述音频数据的播放进度更新至所述第二音频片段的第一起始时间点;将所述文本数据中的所述第二文本片段以预设显示方式显示。
示例性的,与至少一个第三文本片段为至少一个第三音频片段的最终文本转换结果。
示例性的,至少一个第三音频片段的数量为多个时,则第三音频片段为音频数据中连续的音频片段。
根据第一方面,或者以上第一方面的任意一种实现方式,所述第二用户操作包括对所述文本数据的第一操作,所述第一操作包括至少一个点击位置,所述响应于接收到的所述第二用户操作,确定所述第一信息中的至少一个第一映射关系,包括:响应于接收到的对所述文本数据的所述第一操作,基于所述第一信息和所述至少一个点击位置,确 定所述文本数据中,分别位于所述至少一个点击位置之前的字符的至少一个第二字符数量;基于所述第一信息和所述至少一个第二字符数量,确定所述第一信息中的至少一个第一映射关系。
示例性的,本实施方式可以是正向定点播放的场景,通过用户对文本数据中的至少一个点击位置,来确定每个点击位置,在原始记录数据中对应的第一映射关系,这里的点击位置为至少一个,因此,确定的第一映射关系为至少一个。
根据第一方面,或者以上第一方面的任意一种实现方式,所述第一操作仅包括一个点击位置,所述至少一个第一映射关系的数量为一个。
示例性的,该点击位置可以是点击文本数据中两个字符之间的位置,还可以是点击一个字符,本申请对此不做限制,这里可确定一个待播放的第二音频片段,以及一个第二文本片段,所以,第一映射关系的数量为一个。
根据第一方面,或者以上第一方面的任意一种实现方式,所述第一操作包括两个点击位置,所述第一操作用于选中至少两个字符,在所述至少一个第三音频片段的数量为多个时,多个所述第三音频片段为所述音频数据中播放时间连续的音频片段。
示例性的,第一操作可以是对文本数据选中至少两个字符的操作,例如文本选择操作,那么文本选择操作中的起始位置和终止位置则是两个点击位置,可对这两个点击位置,分别在原始记录数据中确定各自的第一映射关系,那么如果确定的两个第一映射关系相同,则可以是用户选中一句话内的至少两个字符的情况。这里的一句话可理解为一个第一文本片段。如果确定的两个第一映射关系不同,则所述至少一个第三音频片段的数量为多个,所述至少一个第三音频片段可包括起始音频片段和终止音频片段,以及该两个文本片段之间的音频片段的连续音频片段。
根据第一方面,或者以上第一方面的任意一种实现方式,所述将所述音频数据的播放进度更新至第二音频片段的第一起始时间点,以及将第二文本片段以预设显示方式显示之后,所述方法还包括:响应于接收到的第三用户操作,从所述第二音频片段的第一起始时间点开始,按照所述至少一个第三音频片段各自的第一时间戳从早到晚的顺序,依次播放所述至少一个第三音频片段。本实施例可播放用户正向定点播放,或反向定点播放对应的至少一个第三音频片段。
根据第一方面,或者以上第一方面的任意一种实现方式,所述第一操作仅包括一个点击位置,所述至少一个第三音频片段的数量为一个,所述至少一个第三文本片段的数量为一个,所述第二音频片段与所述第三音频片段相同;所述依次播放所述至少一个第三音频片段之后,所述方法还包括:在播放至所述第三音频片段的第一结束时间点时,基于所述第一信息,继续播放第二起始时间点为所述第一结束时间点的第四音频片段;在播放至所述第三音频片段的第一结束时间点时,将所述第三文本片段的显示方式恢复为原显示方式,以及基于所述第一信息,将与所述第四音频片段对应的第四文本片段的显示方式从所述原显示方式更新为所述预设显示方式。
示例性的,当用户对文本数据进行点击操作,且点击位置仅为一个时,点击位置可以是一个字符,或两个字符之间的位置,那么可将音频数据的播放进度调整至该点击位 置对应的第三音频片段的起始时间点,以及将与第三音频片段对应的第三文本片段以预设显示方式显示。待音频数据播放至该第三音频片段的结束时间点时,则可继续播放下一个音频片段以及将已播放结束的第三文本片段恢复为原显示方式,而将待播放的下一个音频片段对应的文本片段以预设显示方式显示,实现用户一次正向定点播放操作,则实现后续音频片段的自动定点播放。
根据第一方面,或者以上第一方面的任意一种实现方式,所述第一操作用于选中的至少两个字符;所述依次播放所述至少一个第三音频片段之后,所述方法还包括:在播放至所述至少一个第三音频片段中第一时间戳最晚的第三音频片段的第二结束时间点时,暂停播放所述至少一个第三音频片段,以及将所述至少一个第三文本片段的显示方式恢复为所述原显示方式。
示例性的,当用户选择半句话,一句话,或多句话时,可将用户选中的文本所对应的至少一个第三音频片段播放结束之后,则停止继续播放音频数据中之后的音频片段,以及将与至少一个第三音频片段对应的至少一个第三文本片段的显示方式恢复为原显示方式。可便于用户重听该至少一个第三音频片段,来对至少一个第三文本片段进行编辑纠正操作,提升文本校对效率。
根据第一方面,或者以上第一方面的任意一种实现方式,所述第二用户操作包括对所述音频数据的播放进度的调整操作,所述调整操作包括所述音频数据的播放进度时间,所述响应于接收到的所述第二用户操作,确定所述第一信息中的至少一个第一映射关系,包括:响应于接收到的对所述音频数据的播放进度的调整操作,基于所述第一信息和所述播放进度时间,确定所述音频数据中的一个第一映射关系;其中,所述至少一个第三音频片段的数量为一个,所述第三音频片段对应的时间范围内包括所述播放进度时间;其中,所述时间范围为所述第三音频片段的第三起始时间点和第三结束时间点构成的时间范围。
示例性的,本实施例可为反向定点播放场景,用户可对录制结束的音频数据的播放进度条进行调整,以改变音频的播放进度,电子设备可对用户调整至的播放进度时间,基于原始记录数据,确定该播放进度时间属于音频数据中的第几句话对应的音频片段,以实现对音频的反向定点播放。
第二方面,本申请实施例提供一种电子设备。该电子设备包括:存储器和处理器,所述存储器和所述处理器耦合;所述存储器存储有程序指令,所述程序指令由所述处理器执行时,使得所述电子设备执行如第一方面以及第一方面的任意一种实现方式中的方法。
第二方面所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第三方面,本申请实施例提供了一种计算机可读介质,用于存储计算机程序,当所述计算机程序在电子设备上运行时,使得所述电子设备执行如第一方面以及第一方面的任意一种实施方式中的方法。
第三方面所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式 所对应的技术效果,此处不再赘述。
第四方面,本申请实施例提供了一种芯片,该芯片包括一个或多个接口电路和一个或多个处理器;所述接口电路用于从电子设备的存储器接收信号,并向所述处理器发送所述信号,所述信号包括存储器中存储的计算机指令;当所述处理器执行所述计算机指令时,使得所述电子设备执行如第一方面以及第一方面的任意一种实施方式中的方法。
第四方面所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第五方面,本申请实施例提供了一种包含指令的计算机程序产品,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如第一方面以及第一方面的任意一种实施方式中的方法。
第五方面所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为示例性示出的电子设备的结构示意图之一;
图2为示例性示出的电子设备的软件结构示意图;
图3为传统技术中的一种录音转文本的界面示意图;
图4a为示例性示出的一种定点播放方法的流程图;
图4b为示例性示出的一种定点播放方法的流程图;
图5为示例性示出的一种电子设备的应用场景的示意图;
图6a为示例性示出的一种音频转文本的过程示意图;
图6b为示例性示出的一种原始记录数据的结构示意图;
图7a为示例性示出的一种数据处理过程的示意图;
图7b为示例性示出的一种数据处理过程的示意图;
图8a为示例性示出的一种电子设备的应用场景的示意图;
图8b为示例性示出的一种电子设备的应用场景的示意图;
图8c为示例性示出的一种电子设备的应用场景的示意图;
图8d为示例性示出的一种电子设备的应用场景的示意图;
图8e为示例性示出的一种电子设备的应用场景的示意图;
图8f为示例性示出的一种电子设备的应用场景的示意图;
图8g为示例性示出的一种电子设备的应用场景的示意图;
图9为本申请实施例提供的装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。
本申请实施例的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述对象的特定顺序。例如,第一目标对象和第二目标对象等是用于区别不同的目标对象,而不是用于描述目标对象的特定顺序。
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请实施例的描述中,除非另有说明,“多个”的含义是指两个或两个以上。例如,多个处理单元是指两个或两个以上的处理单元;多个系统是指两个或两个以上的系统。
图1示出了电子设备100的结构示意图。应该理解的是,图1所示电子设备100仅是电子设备的一个范例,可选地,电子设备100可以为终端,也可以称为终端设备,终端可以为蜂窝电话(cellular phone),平板电脑(pad)、可穿戴设备或物联网设备等,本申请不做限定。需要说明的是,电子设备100可以具有比图中所示的更多的或者更少的部件,可以组合两个或多个的部件,或者可以具有不同的部件配置。图1中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。
电子设备100可以包括:处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network  processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为电子设备100充电,也可以用于电子设备100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在本申请另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过电子设备100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存储器,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将 天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以通过无线通信技术与网络以及其他设备通信。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行电子设备100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。电子设备100可以接收按键输入,产生与电子设备100的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照,音频播放等)的触摸操作,可以对应不同的振动反馈效果。作用于显示屏194不同区域的触摸操作,马达191也可对应不同的振动反馈效果。不同的应用场景(例如:时间提醒,接收信息,闹钟,游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和电子设备100的接触和分离。电子设备100可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口195可以支持Nano SIM卡,Micro SIM卡,SIM卡等。同一个SIM卡接口195可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口195也可以兼容不同类型的SIM卡。SIM卡接口195也可以兼容外部存储卡。电子设备100通过SIM卡和网络交互,实现通话以及数据通信等功能。在一些实施例中,电子设备100采用eSIM,即:嵌入式SIM卡。eSIM卡可以嵌在 电子设备100中,不能和电子设备100分离。
电子设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的Android系统为例,示例性说明电子设备100的软件结构。
图2是本申请实施例的电子设备100的软件结构框图。
电子设备100的分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图2所示,应用程序包可以包括相机,图库,日历,通话,地图,录音机,WLAN,蓝牙,音乐,视频,短信息等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图2所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供电子设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
系统库与运行时层包括系统库和安卓运行时(Android Runtime)。系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。3D图形库用于实现三维图形绘图,图像渲染,合成和图层处理等。安卓运行时包括核心库和虚拟机。安卓运行时负责安卓系统的调度和管理。核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应 用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
可以理解的是,图2示出的系统框架层、系统库与运行时层包含的部件,并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。
在一些场景中,例如听课、会议、采访等场景,用户常常需要使用电子设备的录音机应用来对现场语音进行录音,并将录音转换为文本。受限于录音转文本的技术,电子设备转换的文本可存在字符不准确的问题。那么在录音结束后,用户在对转换的文本进行修改时,可通过重放录音来指导修改。
传统技术中,电子设备在对录音转换为文本时,主要是将录音文件上传到云端服务器,云端服务器利用AI(Artificial Intelligence,人工智能)技术,对录音文件转换为文本,并记录录音与文本之间的对应关系。然后,云端服务器将转录的文本以及录音与文本之间的对应关系下发给电子设备。
示例性的,以电子设备为手机为例,如图3所示,用户可操作录音机应用来播放录音,以使手机的显示界面为显示界面101。示例性的,显示界面101可包括一个或多个控件,控件包括但不限于:对录音所转换的文本106、播放暂停控件102、当前播放时间控件105、播放时长控件104、播放进度控件103等。示例性的,该录音的总时长为播放时长控件104示出的1分钟。播放暂停控件102的双竖线图标表示录音处于播放状态。
示例性的,用户点击文本106中的“色”字符,以定位音频。手机可响应于用户的点击操作,根据上述对应关系,来定位该“色”字符所属的文本段对应的音频位置。示例性的,这里的音频位置为0分30秒。手机可将播放进度控件103移动到0分30秒的位置,以及将当前播放时间控件105的显示内容更新为当前播放时间(或者说播放进度),这里为0分30秒。以及手机可将用户点击的字符所属的文本段加粗或颜色加深或变更颜色等方式显示,以区别未被选中的文本。这里选中的文本为“公园中的春色醉人,杜鹃隐藏在芒果树的枝头,成群的画眉像迎亲队似的蹲在杨树的枝头。”。其中,该录音中的第0分30秒为选中的文本的开始播放时间。这样,手机可基于用户在对录音转换的文本中的点击操作来定位音频位置,以播放选中文本对应的录音,来指导用户修改文本。
但是,传统方案在对录音转文本时,需要在录音结束后,将录音上传至云端服务器, 以一份5分钟左右的录音文档为例,上传云端耗时约10秒,转录成可编辑的文本耗时约1分40秒。那么录音时间越长,云端服务器对录音转换为文本的所需要的时间久越长,影响录音转文本效率。而且,从图3可以看到,用户点击文本106中的一个位置后,电子设备定位的文本段,是用户选中字符所属的一个句子(例如以句号、叹号等符号结尾的句子,该句子中包括多个以逗号结尾的文本片段),而非用户选中的“色”字符所属的文本片段“公园中的春色醉人,”。那么句子的文本长度普遍较长,并不利于用户基于定位的音频来修改文本。
为此,本申请的电子设备提供了一种录音转文本的方法,可在电子设备本地将录音转换为文本,并支持对录音的正向定点播放和反向定点播放,以便于用户依据定点播放的录音,来对对应的文本内容进行增加、删除、修改等文本操作。并且,修改后的文本依旧支持上述正向定点播放和反向定点播放,可提升对录音转文本的效率,以及对文本的校对效率。
其中,正向定点播放表示:用户在对转录的文本中选择需要重听录音的文本段,电子设备可将录音的播放进度调整到可播放该文本段的播放时间并播放录音。
反向定点播放表示:用户调整录音的播放进度(例如拖动录音的播放进度条)后,电子设备可在文本(对该录音转换的文本)中,定位与该播放进度对应的文本段并播放录音。
示例性的,图4a为示例性示出的一种定点播放的方法的流程图。以电子设备为手机为例,该流程图可包括:S101、S103、S105。
S101,手机在音频转文本的过程中,获取音频的原始记录数据。
示例性的,原始记录数据可包括对音频转换的文本中文本片段的文本长度L(例如字符数量)与该文本段对应的音频片段的时间戳T。示例性的,该时间戳T可以是该音频片段的起始播放时间,或结束播放时间。
需要说明的是,本实施方式的方法可应用于中文、英文、日文、韩文等各种语言的音频转文本场景,本申请以中文为例进行说明,应用于其他语言时,方法同理,这里不再赘述。
在一种可能的实施方式中,图4b示例性的示出了S101的实现过程,图5为示例性示出的该电子设备的应用场景示意图。
在本实施方式中,手机安装有应用,该应用集成有音频录制(或采集)功能与文本编辑功能。本申请对该应用追加了录音转文本的功能,以及定点播放功能。
示例性的,该应用可以是录音机应用(具备音频录制、录音转文本、文本编辑以及定点播放功能),或,即时通讯应用(具备音频录制、录音转文本、文本编辑以及定点播放功能)等,本申请对此不做限制。该方法还可应用至少两个应用程序,例如手机中安装的录音机应用可具备音频录制功能以及录音转文本功能,手机中安装的备忘录应用具备文本编辑功能以及定点播放功能,通过录音机应用与备忘录应用交互,来实现本申请的技术方案,具体实现细节与使用单个应用实现的方法同理,这里不再赘述。
如图4b所示,该过程可包括如下步骤:
S200,手机启动应用的音频采集和音频转文本的功能。
示例性的,如图5(1)所示,手机的显示界面401包括一个或多个控件。该控件包括但不限于:电量控件、网络控件、应用图标。示例性的,用户可点击显示界面401中的录音机应用的图标402,以启动录音机应用。如图5(2)所示,手机的显示界面从显示界面401切换为显示界面403。示例性的,显示界面403可包括一个或多个控件。该控件可包括但不限于:搜索录音文件的控件404,开始录音的控件406,录音转文本的选项控件405等。选项控件405包括开关控件4051。在图5(2)中,开关控件4051处于关闭状态,那么在该状态下,用户点击控件406,则录音机应用可响应于该用户操作进行录音,并不会在录音过程中,对实时录制的音频转换为文本。示例性的,用户点击开关控件4051,以将开关控件4051的状态切换为开启状态,录音机应用可响应于该用户操作,如图5(3)所示,将开关控件4051置为开启状态,这样,录音机应用就启动了音频转文本的功能。然后,用户点击图5(3)中的开始录音的控件406,则录音机应用可启动音频采集的功能。
S201,应用采集音频。
示例性的,如图5(3)所示,用户点击控件406,则录音机应用可响应于该用户操作开始录音,以采集音频。如图5(4)所示,手机的显示界面切换为录音机应用的录音界面501。示例性的,录音界面501可包括一个或多个控件。该控件可包括录音进度控件506,当前录音时长控件505,示例性的,当前录音时长控件505示出的已录制的音频的时长为5秒。此外,录音界面501还可包括标记控件502,可用于在录音过程中对关注的录音节点处添加标记,以便于用户通过该标记来定位音频的播放进度。另外,录音界面501还可包括暂停录音控件504,以用于对当前录制的音频进行暂停录制。以及录音界面501还可包括结束录音控件503,以终止本次录音。
S202,应用对实时采集的音频转换为文本。
示例性的,在用户点击图5(3)中的开始录音的控件406之后,则录音机应用可实时采集音频,并对实时采集的音频转换为文本。录音机应用可在录音过程中,对已录制的音频转换为文本。
对于录音机应用将音频转换为文本的具体实现方式可采用任意一种音频转文本算法,本申请对此不做限制。
S203,应用判断该文本是否为临时结果。
首先对临时结果和最终结果进行解释:
示例性的,录音机应用在对实时录制的音频转换为文本时,录音机应用可利用音频转文本算法依次输出多段临时结果(是对录音转换成的文本),其中,后输出的临时结果是对先输出的临时结果的修正更新,可理解为对转换的文本的纠正,其中,在输出更新后的临时结果时,更新前的临时结果则不再显示,即同一时刻,录音机应用只输出并显示一个临时结果。示例性的,录音机应用可采用后输出的临时结果,来对先输出的临时结果进行刷新显示。那么录音机应用多次刷新显示临时结果后,在检测到最近输出的一次临时结果的语义完整时,则可将最近一次输出的临时结果作为最终结果(作为录制 的音频中的一个音频片段所转换成的文本)输出。示例性的,在输出最终结果时,可将最近一次输出的临时结果的显示方式变更为对应于最终结果的显示方式,以提醒用于当前输出结果为最终结果。再如,示例性的,在输出最终结果时,也可按照最终结果的显示方式,将最近一次输出的临时结果刷新为最终结果。然后,录音机应用可将本轮输出的多个临时结果的相关数据清空,而保存一个最终结果及其L与T的映射关系(具体介绍见下文)。接着,对实时采集的音频继续转换为文本,从而再次依次输出多段临时结果和一个最终结果,如此循环,直至录音结束,那么在录音结束时,录音机应用已对录制的音频完成了文本的转换。那么对完整的录音转换后的文本则是由多个最终结果构成,相应的,录音机应用界面的显示内容将是多个最终结果,而不包括临时结果。
另外,临时结果和最终结果输出显示在录音机应用的录音界面时,两种结果的显示方式不同,例如临时结果的字号比最终结果的字号更小,临时结果的文本为斜体,最终结果的文本为正体等,具体显示方式不做限制。
关于音频转文本算法中涉及的临时结果与最终结果的具体介绍,可参考已有的音频转文本算法的具体实现,本申请在此不再赘述。
示例性的,最终结果是语义完整的一段文本,该段文本可以包括一个或多个标点符号,语义是否完整的判定条件由音频转文本算法确定,本申请对此不做限制。
示例性的,在将录制的音频转换为文本的过程中,录音机应用输出的临时结果的精度可与音频转文本算法的精度相关联。
示例性的,音频转文本算法的精度为单个字符,那么录音机应用对录制的音频每转换为一个字符(例如字或词或符号等),就会生成并输出一次临时结果。其中,录音转文本算法可在检测到音频中存在停顿时,可在转换的文本中增加标点符号,并将增加了标点符号的文本作为一个更新后的临时结果输出。
示例性的,例如用户的语音对应的文本为“你好!”。那么输出的临时结果依次为:“你”,“你好”,“你好!”。最后一次输出的临时结果“你好!”的语义完整,则最终结果为“你好!”。其中,在录音机应用的录音界面显示的文本依次更新为:“你”,“你好”,“你好!”。输出一个临时结果之后,如果再输出下一个
示例性的,录音机应用可在录音过程中,可将实时得到的临时结果和最终结果输出显示在图5(4)和图5(5)所示的录音界面501,例如在当前录音时长控件505的下方的一个区域,本申请对此不做限制。
示例性的,如图5(4)所示,当当前录音时长为当前录音时长控件505示出的0分5秒时,录音界面501示出了已转换的一个最终结果,以及在该最终结果之后输出的一个临时结果。这里的最终结果的文本内容为录音界面501中较大虚线框示出的文本“近日,记者在青湖公园看到一群小朋友”。该临时结果为录音界面501中较小虚线框示出的文本“在老师”。这里的虚线框只是用于示意临时结果和最终结果,在录音界面501显示临时结果和最终结果中的任意一个结果时,并不显示这里示出的虚线框。其中,临时结果的字号比最终结果的字号小,且临时结果为斜体字体,以便于用户区分当前输出的转换后的文本中哪些文本为最终转换的文本(最终结果),哪些文本为转换的中间结果(临时 结果)。示例性的,临时结果与最终结果在录音界面501进行显示时,可以不同的显示方式进行显示,以便于用户区分临时结果和最终结果。这样,在录音过程中,用户可在录音界面实时的浏览到对录制的音频的实时转换成的文本内容。
此外,需要说明的是,本实施例的音频转文本算法的精度为单个字符,那么在录音界面501中输出示出的一个最终结果后,依次输出临时结果0、临时结果1、临时结果2,其中,临时结果0对应的文本依次为“在”,临时结果1对应的文本依次为“在老”,临时结果2对应的文本依次为“在老师”。如上文所述,在临时结果更新输出时,先输出的临时结果将不再显示,因此,在图5(4)中仅示意了一个临时结果2对应的文本内容。
示例性的,录音机应用每更新一次临时结果,则将录音界面501中显示的临时结果更新显示。待输出最终结果时,则录音机应用可将上一次输出的临时结果更新为最终结果。示例性的,如图5(5)所示,在当前录音时长为当前录音时长控件505示出的1分钟时,录音机应用已在录音界面501输出多个最终结果,多个最终结果的文本见图5(5)中示出的虚线框内的文本内容,这里不再赘述。另外,图5(5)中的虚线框只是用于说明多个最终结果,在实际应用时,并不显示该虚线框。
S204,应用基于临时结果中的断句标点符号,实时记录每个音频片段的时间戳T。
示例性的,以中文为例,断句标点符号为用于表示断句的预设标点符号,例如断句标点符号可包括但不限于:逗号、句号、感叹号、问号、分号、顿号等。而诸如冒号、括号等符号则不具有断句的意义,可不作为断句标点符号。
示例性的,时间戳T还可以是音频片段的开始的时间点或者结束的时间点对应的当前录音时长,例如一个音频片段是本次已录制音频中在第1秒至第10秒录制的音频片段,那么该音频片段的时间戳T可以是1秒或10秒。
示例性的,时间戳T可以是音频片段的开始的时间点或者结束的时间点对应的手机的系统时间,例如北京时间。例如一个音频片段是本次已录制音频时,在北京时间14点0分0秒至14点0分10秒录制的音频片段,那么该音频片段的时间戳可以是北京时间14点0分0秒,或者,北京时间14点0分10秒。
在其他实施例中,时间戳T还可以是其他类型的用于标识音频片段的录音进度的时间信息,而不限于上述举例的当前录音时长,或系统时间。
在本申请实施例中,每个临时结果的输出时间为音频采集的当前时间,那么录音机应用可在检测到临时结果包括断句标点符号时,获取音频采集的当前时间(例如上述音频片段开始的时间点或者结束的时间点对应的系统时间,或当前录音时长),以得到该临时结果中该断句标点符号对应的时间戳。
示例性的,图6a为示例性示出的依次输出的多个临时结果以及时间戳的示意图。
在图6a中Pi表示一个临时结果中第i个断句标点符号,ti表示Pi的时间戳T。i为从0开始的整数,对于i的最大值不限制。
在图6a中各临时结果中的文本未示出,这里以一条线段示意,临时结果中的断句标点符号,则以黑色圆点示出,指向黑色圆点的箭头示例出了该断句标点符号的符号内容。图6a中各临时结果中的时间戳ti,表示该临时结果中第i个断句标点符号的时间戳。虽 然在图6a中不同子图示出了相同的ti,例如图6a(1)和图6a(4)中均包括时间戳t0,但是,在不同临时结果中,该时间戳t0的取值可能不同,时间戳t0只是用于表示所属临时结果中第0个断句标点符号的时间戳而已。
另外,需要说明的是,本申请各实施方式中,在对某个对象进行排序时,均从第0个开始排序,这是以计算机语言为基准的排序原则,在自然语言理解上,例如计算机语言中的第0个断句标点符号则是自然语言理解中的第1个断句标点符号。
示例性的,如图6a(1)所示,录音机应用检测到临时结果0中包括一个断句标点符号P0,例如P0对应的符号为逗号。在临时结果0之前还未出现过带断句标点符号的临时结果,则录音机应用可以获取当前录音时长(这里以当前录音时长为例来说明时间戳)t0,并记录P0与t0的映射关系,例如t0=1s,则映射关系包括P0对应于1s。
继续参照图6a(2),录音机应用输出临时结果1,录音机应用检测到临时结果1包括2个断句标点符号(例如P0对应的符号为逗号,P1对应的符号为句号),则录音机应用可检测本次输出的临时结果1相比于上一次输出的临时结果,这里为临时结果0,断句标点符号的数量是否发生了变化,这里断句标点符号的数量得到了增加,则录音机应用可获取当前录音时长t1(例如2s),来作为临时结果中最后一个断句标点符号(这里的P1)的时间戳,并继续记录P1与t1的映射关系,则更新后的映射关系包括P0对应于1s,P1对应于2s。
继续参照图6a(3),录音机应用输出临时结果2,录音机应用检测到临时结果2包括1个断句标点符号(例如P0对应的符号为逗号),则录音机应用可检测本次输出的临时结果2相比于上一次输出的临时结果1,断句标点符号的数量减少。在输出临时结果2之前,记录的关于临时结果的映射关系包括P0对应于t0,P1对应于t1。那么录音机应用可将记录的最后一个标点符号对应的映射关系删除,即这里的P1与t1的映射关系。并获取当前录音时长(例如2.1s),将图6a(2)中示出的上述映射关系中P0对应的t0的取值更新为该当前录音时长,那么如图6a(3)所示,更新后的映射关系包括P0对应于t0(这里为2.1s)。
在图6a(1)变化到图6a(2),或从图6a(2)变化到图6a(3)的场景中,在断句标点符号增加或减少时,均是从上一次临时结果中出现的最后一个断句标点符号的后面开始增加断句标点符号,或是从上一次临时结果中出现的最后一个断句标点符号开始向前减少断句标点符号,以对文本的末尾位置增加或删除标点符号。这种断句标点符号的增减场景,主要是音频转文本算法检测到上一次临时结果的断句错误,从而在下一次输出的临时结果中进行断句位置的纠正。例如从图6a(1)变化到图6a(2)的场景可以是音频转文本算法确定在P0之后的文本中应该增加一个断句标点符号,从而增加了P1。再如从图6a(2)变化到图6a(3)的场景,可以是音频转文本算法确定P1位置不应该出现断句,从而将临时结果1中的P1对应的符号删除,以输出临时结果2。
可选地,在一些场景中,可能还存在本次输出的临时结果中增加或减少的断句标点符号,是在上一次输出的临时结果中最后一个标点符号之前的位置的情况,这种情况主要是语义反复造成的。
示例性的,可参照图6a(4),在录音机应用输出临时结果2之后,输出了对于临时结果2进行纠正的临时结果3。录音机应用检测到临时结果3包括2个断句标点符号(例如感叹号和逗号),则录音机应用可检测本次输出的临时结果3相比于上一次输出的临时结果2,断句标点符号的数量得到了增加。在输出临时结果3之前,记录的关于临时结果的映射关系只有P0对应于t0(这里t0=2.1s)。那么本次临时结果的断句标点符号的数量得到了增加,则录音机应用可获取当前录音时长(例如2.5s),以作为本次临时结果3中最后一个断句标点符号的时间戳t1,以继续记录P1与t1的映射关系,则更新后的映射关系包括P0对应于2.1s,P1对应于2.5s。那么该实施方式中,临时结果3中的最后一个断句标点符号(这里为逗号)的时间戳为2.5s,参照图6a中示出的临时结果2对应的映射关系,逗号对应的时间戳t0为2.1s,但是在临时结果3对应的映射关系中,该逗号的准确时间戳为2.5s,从而导致更新后的临时结果中部分断句标点符号的时间戳存在一定误差。但是,在本次输出的临时结果中增加的断句标点符号所处的位置,是在上一次输出的临时结果中最后一个标点符号之前的位置的情况比较少,该误差不影响整体的音频转文本的时间戳的准确度。
示例性的,可参照图6a(5),在录音机应用输出临时结果3之后,录音机应用继续输出对于临时结果3进行纠正的临时结果4。录音机应用检测到临时结果4包括1个断句标点符号(例如逗号),则录音机应用可检测本次输出的临时结果4相比于上一次输出的临时结果3,断句标点符号的数量得到了减少。在输出临时结果4之前,记录的关于临时结果的映射关系包括P0对应于2.1s,P1对应于2.5s。那么本次临时结果中的断句标点符号的数量得到了减少,则录音机应用可将记录的最后一个标点符号对应的映射关系删除,即删除图6a(4)中的P1与t1(这里为2.5s)的映射关系。并且,录音机应用还可获取当前录音时长(例如2.55s),将图6a(4)中示出的映射关系中P0对应的t0的取值更新为该当前录音时长,那么如图6a(5)所示,更新后的映射关系包括P0对应于t0(这里为2.55s)。
在图6a(3)变化到图6a(4),或从图6a(4)变化到图6a(5)的场景中,主要体现了本次输出的临时结果中增加或减少的断句标点符号所处的位置,是在上一次输出的临时结果中最后一个标点符号之前的位置的情况,这种情况主要是语义反复造成的,在实际应用中,该场景较少。
在图6a(5)之后,录音机应用继续输出下一个临时结果,如图6a(6)所示的临时结果5,录音机应用检测到临时结果5包括2个断句标点符号(例如P0对应的符号为逗号,P1对应的符号为句号),则录音机应用可检测本次输出的临时结果5相比于上一次输出的临时结果4,断句标点符号的数量得到了增加,则录音机应用可获取当前录音时长t1(例如2.6s),来作为临时结果中最后一个断句标点符号(这里的P1)的时间戳,并继续记录P1与t1的映射关系,则更新后的映射关系包括P0对应于2.55s,P1对应于2.6s。
可选地,当录音机应用对比前后两次临时结果的断句标点符号的数量没有发生变化,则不触发时间戳的获取,也不触发对Pi与ti的映射关系的更新。
在本申请实施例中,录音机应用每输出一个临时结果,就可检测该临时结果是否包 括断句标点符号,如果包括断句标点符号,则录音机应用可获取断句标点符号的数量。在本次输出的临时结果之前也存在输出的临时结果时,录音机应用在检测到最近两次的临时结果之间,断句标点符号的数量得到更新(例如增加),则可以获取时间戳(例如当前录音时长)以作为本次输出的临时结果中最后一个断句标点符号的时间戳。可选地,可按照断句标点符号的排列次序,来更新断句标点符号对应的时间戳。录音机应用在检测到最近两次的临时结果之间,断句标点符号的数量得到更新(例如减少),则录音机应用同样可获取时间戳(例如当前录音时长),并且,在最近一次记录的断句标点符号的排列次序与时间戳的映射关系中,将排列次序为最后一个断句标点符号与其时间戳的映射关系删除,经删除操作后,将本次获取的上述时间戳同步更新至此刻上述映射关系中,处于最后一个断句标点符号(即删除操作前,排列次序为倒数第二个断句标点符号)对应的时间戳。
需要说明的是,如前文所述,音频转文本算法的精度为单个字符,即每更新一个字符,就输出一个临时结果,其中,断句标点符号也是一个字符。那么在最近两次临时结果之间,在断句标点符号的数量发生变化时,一般情况下,数量只会增加或减少一个,所以,转换的文本中只要增加或减少一个断句标点符号,录音机应用就可输出一个临时结果。并且,数量增减的断句标点符号在本次临时结果中的位置,普遍位于上一次输出的临时结果中排列次序为最后一个的断句标点符号之前或之后的位置。
还需要说明的是,图6a(1)至图6a(6)旨在体现断句标点符号数量发生更新的各主要场景,在实际应用中,图6a(1)至图6a(6)分别表示的断句标点符号数量增减的场景并不一定会连续的出现,图6a仅作为示例来说明在断句标点符号数量发生更新时,本申请的录音机应用如何更新临时结果中相应断句标点符号的时间戳,而并不用于限制本申请。
继续参照图6a,在图6a(6)之后,录音机应用检测到临时结果5的语义完整,如图6a(7)所示,录音机应用可将临时结果5作为最终结果输出。
S205,应用根据最终结果,记录每个音频片段的时间戳T和该音频片段对应的文本片段的文本长度L。
示例性的,当录音机应用检测到本次输出的文本为最终结果时,则可将最近一次临时结果对应的Pi与ti的映射关系作为最终结果的Pi与ti的映射关系,这里为图6a(6)所示的P0对应于t0(例如2.55s),以及P1对应于t1(例如2.6s)。然后,录音机应用可基于最终结果的Pi与ti的映射关系中的Pi在最终结果中的位置,来记录每个文本片段的字符数量。如图6a(7)所示,录音机应用可计算从最终结果的起始位置到对应于时间戳t0的P0(即自然语言表示的第一个断句标点符号,这里对应的符号为逗号)的字符数量l0,其中,P0对应的逗号可以计数在字符数量l0之内。同理,录音机应用可计算最终结果中在P0之后至P1(包括P1)的字符数量l1,从而生成图6a(8)所示的录音转文本中每个文本片段(或者说每个音频片段)的L与T的映射关系,这里包括l0与t0的映射关系,l1与t1的映射关系。
在图6a(8)中第一句话表示对音频转换后的文本中包括一个断句标点符号的第一个 文本片段。第二句话表示在第一句话之后包括一个断句标点符号的文本片段。图6b为示例性示出的对音频转换文本后,生成的原始记录数据,即每句话的L与T的映射关系。
示例性的,可参照图6b,录制的音频或对音频转换后的文本中,第一句话(例如一个文本片段)的L为l0,即第一句话的字符数量,第一句话对应的音频片段的时间戳为t0,其中,t0为第一句话对应的音频片段的结束时间。
示例性的,可参考图8a(2)中示意的由音频转换的文本603来理解。在原始记录数据中,第一句话为“近日,”,第一句话的字符数量为3,那么l0=3;第二句话为“记者在青湖公园看到一群小朋友在老师的带领下,”,l1=22;第三句话为“用稚嫩的小手画出公园风光,”,l2=13;第四句话为“感受白花盛开风景如画的春天。”,l3=14;第五句话为“公园中的春色醉人,”,l4=9;第六句话为“杜鹃隐藏在芒果树的枝头,”,l5=12;第七句话为“成群的画眉像迎亲队似的蹲在杨树的枝头。”,l6=19。
当然,在其他实施例中,也可以将一个音频片段的T设置为该音频片段的起始时间,例如图6b中第一句话的L为l0,T为0,第二句话的L为l1,T为t0。
本实施例中,录音机应用检测到临时结果中出现断句标点符号时,可实时记录该断句标点符号的时间戳,从而实现对包括该断句标点符号的文本片段的时间戳的记录。当检测到最终结果时,则将对最近一次的临时结果记录的时间戳进行转存,以及记录最终结果中每一句话的字符数量。那么基于临时结果所生成的时间戳是准确的,基于最终结果所生成的字符数量是准确的,从而可以基于临时结果和最终结果,来得到录音转文本中每个音频片段的时间戳T以及对应该音频片段的文本片段的字符数量L的映射关系。录音机应用可在实时转录的过程中,在每次输出最终结果后,可将该最终结果对应的ti与li的映射关系持久化存储到本地文件系统。或者,也可以在录音结束后,将各个最终结果对应的ti与li的映射关系持久化存储到本地文件系统,本申请对此不做限制。
在本申请实施例中,手机可在录音转文本的同时生成原始记录数据。其中,手机可根据断句标点符号来对文本片段进行分割,记录每一文本片段(例如每一句话)的时间戳T与字符数量L的映射关系,以生成原始记录数据。该原始记录数据具有时间戳T与字符数量L的信息,可在手机进行定点播放时,作为用于计算定点播放位置的依据,从而实现准确的定点播放。而且,该原始记录数据中不需要对录音转换后的文本增加特殊字符,不会破坏文本内容的段落编排。
另外,本申请实时方式中,录音过程,与对录音转换为文本的过程是同步进行的,且音频录制与文字编辑的功能集成在同一个应用中,可实现边录边转,在录音结束时,也是录音转文本结束时,这样无需在录音结束后,再对录音转换为文本,使得录音转文本的效率更高。
需要说明的是,一组临时结果对应于一个最终结果,随着录音转文本的进度的变化,可得到多组临时结果以及对应的多个最终结果,不同组临时结果之间的数据相互独立,同理,不同个最终结果之间的数据相互独立。那么在对比两次临时结果之间断句标点符号的数量是否发生变化时,是对对应于同一个最终结果的一组内的临时结果进行对比,而无需与其他组的临时结果进行对比。
S206,应用判断是否接收到停止采集音频的操作。
示例性的,在每次输出一个临时结果,或每次输出一个最终结果之后,录音机应用可判断录音是否结束,如果录音没有结束,则转至S201继续循环执行以上步骤,直至录音结束。
示例性的,可参照图5(3)至图5(5),在用户点击控件406开始,直至用户点击结束录音控件503,录音机应用可循环执行图4b所示的S201至S206。
示例性的,在录音机应用录制到如图5(5)中当前录音时长控件505示出的1分钟时,用户点击结束录音控件503,则录音机应用可接收到停止采集音频的操作,从而转至执行S207。
S207,停止音频采集。
示例性的,如图5(6)所示,在停止录音之后,手机的显示界面切换为显示界面403,显示界面403可包括一个或多个控件,该控件可包括录音结果控件406,录音结果控件406可包括录音名称控件(这里的录音名称为“录音1”),录音时间控件(这里为2022年3月1日),以及播放录音控件4061。
示例性的,如图8a(1)所示,用户点击播放录音控件4061可播放该录音,以显示图8a(2)所示的显示界面601,显示界面601包括一个或多个控件。该控件可包括对图8a(1)示出的录音1,在实时录制过程中实时转换的文本603,播放进度条控件602,播放进度条控件602包括播放进度条6023、播放进度控件6024、播放暂停控件6025、当前播放时间控件6021,音频时长控件6022。录音1的音频时长为音频时长控件6022示出的1分钟,该录音1的当前播过进度为当前播放时间控件6021示出的0分0秒。如播放暂停控件6025中的双竖线图标所示,该录音1当前处于播放状态。
示例性的,在S207之后,继续回到图4a,在S101之后,可继续执行S103。
S103,手机响应于对音频或该音频的文本的定点播放操作,根据所述原始记录数据,确定至少一个定点播放位置。
该定点播放操作可分为正向定点播放和反向定点播放,下面结合这两种场景,分别进行描述:
方式1:正向定点播放
示例性的,可参照图8a(2),用户可在文本603中点击某个位置,或选中半句话,或一句话,或连续的多句话,这里以包括一个断句标点符号的文本来定义一句话(一句话的定义具体可参照图6b的解释说明)。手机可响应于该用户操作,依据原始记录数据(例如图6b所示的原始记录数据)来确定定点播放位置,例如用户点击文本对应的是原始记录数据中的第几句话。
实施方式1、用户点击文本中的单个位置
1)录音机应用获取该位置在显示的文本(例如图8a(2)示出的文本603)中的坐标Q(x,y)。
2)录音机应用根据坐标Q(x,y),获取上述位置之前的文本的总字数offsetCount。
3)录音机应用依据原始记录数据,按照原始记录数据中各句话的次序,循环计算至 少一句话的总字数totalCounti。例如,首先依据第一句话(标识为index(0))的l0,计算totalCount0=l0;然后,计算第一句话以及第二句话(标识为index(1))的总字符数量,得到totalCount1=l0+l1;然后,计算第一句话、第二句话、第三句话(标识为index(2))的总字符数量,得到totalCount2=l0+l1+l2;……最多计算到原始记录数据中的全部语句的总字符数量,得到totalCount(n-1)=l0+l1+l2+l3……+l(n-1)。例如图6b所示的原始记录数据,一共有n句话的L与T。在上述循环计算过程中,每计算一次totalCount,则录音机应用可判断当前计算的totalCount(i)是否大于或等于上述offsetCount。如果当前计算的totalCount(i)小于上述offsetCount,则继续进行下一次totalCount(i+1)的计算。如果当前计算的totalCount(i)大于或等于上述offsetCount,则确定当前计算的totalCount(i)中遍历到的语句index(i)为定点播放位置。
例如i=3,则录音机应用可确定用户点击位置在图6b所示的第四句话中,其中,第四句话的标识为index(3)。
实施方式2、用户选中由音频转换成的文本中的半句话,或一句话,或连续的多句话
1)录音机应用可获取用户选中的目标文本中起始位置和结束位置,在由音频转换成的文本(例如图8a(2)示出的文本603)中分别对应的坐标Q1(x,y)和坐标Q2(x,y)。
2)录音机应用根据坐标Q1(x,y),获取起始位置之前的文本的总字数offsetCount1;根据坐标Q2(x,y),获取终止位置之前的文本的总字数offsetCount2。
本步骤的原理,与实施方式1中的步骤2)的原理类似,这里不再赘述。
3)录音机应用依据原始记录数据,基于offsetCount1,确定起始位置在原始记录数据中对应的定点播放位置1(例如index(i));以及依据原始记录数据,基于offsetCount2,确定终止位置在原始记录数据中对应的定点播放位置2(例如index(j))。
本步骤的原理,与实施方式1中的步骤3)的原理类似,这里不再赘述。那么录音机应用可以确定起始语句与结束语句在原始记录数据中的标识index,从而确定起始位置属于由录音转换成的文本中的第i+1句话,以及终止位置属于该文本中的第j+1句话。
方式2:反向定点播放
示例性的,可参照图8a(2),用户可在播放进度条6023中单击某个位置,或者将播放进度控件6024拖动到该位置,来改变录音的播放进度。手机可响应于该用户操作,依据原始记录数据(例如图6b所示的原始记录数据)来确定定点播放位置,例如用户在播放进度条6023中选中的音频播放位置,对应的是原始记录数据中的第几句话。
实施方式3
1)录音机应用获取用户对录音1调整后的播放进度对应的当前播放时间progressTime。
例如,用户在图8a(2)示出的播放进度条6023中单击某个位置,或者将播放进度控件6024拖动到该位置。录音机应用可响应于该用户操作来获取该位置对应的当前播放时间progressTime。
2)录音机应用依据原始记录数据,按照原始记录数据中各句话的次序,遍历各句话的ti,其中,ti为第i+1句话的结束播放时间,示例性的,如图6b所示,i可从0开始,i 的最大值为(n-1)。i=0时,录音机应用可首先判断progressTime是否在0至t0之间(不包括0,包括t0);如果progressTime没有在0至t0之间,则i=1,录音机应用继续判断progressTime是否在t0至t1之间(不包括t0,包括t1);如果progressTime没有在t0至t1之间,则i=2,录音机应用继续判断progressTime是否在t1至t2之间(不包括t1,包括t2)……如此循环,直至检测到progressTime是否在t(i-1)至ti之间(不包括t(i-1),包括ti),其中,如图6b所示,原始记录数据对应n句话,则i的最大值为(n-1)。例如i=3,progressTime属于t2与t3之间,则可以确定index(i)=index(3),其中,index(3)为第四句话的标识,可以确定定点播放位置为第四句话。
S105,手机基于至少一个定点播放位置和所述原始记录数据,更新音频的播放进度,在文本中标识目标文本片段。
方式1:正向定点播放
在实施方式1中,在用户单击文本中的某个位置时,该至少一个定点播放位置为在原始记录数据中确定的一句话的标识index(i);
那么录音机应用可依据原始记录数据和该标识index(i),确定第i+1句话的li和ti,如图6b所示,例如i=3,则定位到第四句话,L=l3,T=t3,其中,t3为第四句话对应的音频片段的结束时间(例如结束播放时间),如图6b所示的t2为第四句话对应的音频片段的起始时间(例如开始播放时间)。
示例性的,可参考图8a(2),手机可将播放进度控件6024沿播放进度条6023移动,播放进度控件6024移动后的位置对应的当前播放时间为t2,当前播放时间控件6021示出的时间将更新为t2。此外,手机还可将文本603中的第四句话以预设显示方式进行显示。该预设显示方式可区别于文本603中除该第四句话之外的其他文本的显示方式。这里的目标文本片段(例如第四句话)为“感受白花盛开风景如画的春天。”
需要说明的是,预设显示方式与第四句话的原显示方式之间可以是字号、字体、字体颜色、字体阴影、字体背景色等至少一种显示方式的区别,本申请对此不做限制。例如预设显示方式可以是字体背景色为蓝色。
在实施方式2中,在用户选中一段文本时,该至少一个定点播放位置包括标识index(i)和标识index(j)。
那么录音机应用可依据原始记录数据和该标识index(i),确定第i+1句话的li和ti,如图6b所示,例如i=2,则定位到第三句话,L=l2,T=t2,其中,t2为第三句话对应的音频片段的结束时间(例如结束播放时间),如图6b所示的t1为第三句话对应的音频片段的起始时间(例如开始播放时间)。
那么录音机应用可依据原始记录数据和该标识index(j),确定第j+1句话的lj和tj,如图6b所示,例如j=3,则定位到第四句话,L=l3,T=t3,其中,t3为第四句话对应的音频片段的结束时间(例如结束播放时间),如图6b所示的t2为第四句话对应的音频片段的起始时间(例如开始播放时间)。
示例性的,可参考图8a(2),手机可将播放进度控件6024沿播放进度条6023移动,播放进度控件6024移动后的位置对应的当前播放时间为t1,当前播放时间控件6021示出 的时间将更新为t1。此外,手机还可将文本603中的第三句话以及第四句话以预设显示方式进行显示,这里的第三句话包括“用稚嫩的小手画出公园风光,”,第四句话包括“感受白花盛开风景如画的春天。”。该预设显示方式可区别于文本603中除该第三句话和第四句话之外的其他文本的显示方式。
方式2:反向定点播放
在上述实施方式3中,在手机执行S105时,其实现方式与上述实施方式1的实现方式相同,具体可参考实施方式1中,在实现S105时的具体描述,这里不再赘述。
本申请实施例中,手机可在实时录音的过程中,对录制的音频进行实时文本的转换,并实时生成上述原始记录数据,原始记录数据可包括每句话的L(字符数量)与时间戳T(例如起始播放时间,或结束播放时间)。在对录音转文本之后,用户在手机上发现有转换后的文本中存在语义不通顺等需要修改的文本时,用户可对转换后的文本进行定点操作,或对录音的播放进度条进行定点操作,手机可依据定点操作的位置,结合原始记录数据,来确定定点播放位置,例如用户定点操作的位置属于原始记录数据中的第几句话。从而结合原始记录数据中该句话的L与T以及上一句话的L与T,来对音频进行定点播放,以及对定位的文本进行突出显示,实现对录制的音频的正向定点播放和反向定点播放。
示例性的,上述实施方式中的录音机应用可包括音频控制模块、文本视图模块、定点播放模块,图7a为示例性示出的正向定点播放场景下,该录音机应用的数据处理过程的示意图,可结合图6b以及图8a至图8f的场景示意图来理解。
如图7a所示,该过程可包括如下步骤:
S301,音频控制模块获取录音的原始记录数据。
示例性的,本步骤的具体实现过程可参考图4b、图5、及图6a和图6b的相关实施例的描述,这里不再赘述。
S303,文本视图模块响应于对录音转换的文本的定点播放操作,根据点击位置的坐标获取点击位置之前的文本总字数。
示例性的,在手机显示图8a(2)示出的显示界面601以对录音进行播放后,在录音播放一段时间后,手机可显示图8b(1)所示的显示界面601,以使当前播放时间控件6021示出的当前播放时间更新为0分5秒,且播放进度控件6024在播放进度条6023上的位置发生移动。
在一种可能的实施方式中,用户可对文本603进行单击操作,那么文本视图模块在执行S303时,可通过上述实施方式1来实现,具体实现过程可参照对实施方式1的介绍。
示例性的,用户阅读图8b(1)中示出的文本603,发现语句不通顺的地方,单指点击该位置,例如用户文本603中的位置1,位置1对应的文本为“白”,那么该用户点击操作可产生点击事件,文本视图模块可处理该点击事件,以获取该位置1的坐标(x1,y1)。然后,文本视图模块可根据坐标(x1,y1)来获取文本603中位于位置1之前的文本的总字数offsetCount。
在另一种可能的实施方式中,用户可对文本603选中半句话,一句话,或连续的多 句话,文本视图模块在执行S303时,可通过上述实施方式2来实现,具体实现过程可参照对实施方式2的介绍。
示例性的,图8c为示例性示出的用户对文本603选中半句话的场景的示意图。另外,用户对文本603选中一句话的场景的实现过程与图8c的场景的过程类似,这里不再赘述。
在手机显示图8a之后,随着录音的播放进度的增加,手机可显示如图8c(1)所示的显示界面601,用户可选中文本1(这里为“盛开风景如画”),那么文本1中的起始字符的位置为字符“盛”所处的位置2,终止字符的位置为字符“画”所处的位置3。那么该用户选中文本的操作可产生两个点击事件,分别是对位置2的点击事件和对位置3的点击事件。那么文本视图模块可处理该两个点击事件,以获取该位置2的坐标(x2,y2),以及位置3的坐标(x3,y3)。然后,文本视图模块可根据坐标(x2,y2)来获取文本603中位于位置2之前的文本的总字数offsetCount1,以及根据坐标(x3,y3)来获取文本603中位于位置3之前的文本的总字数offsetCount2。
示例性的,图8d为示例性示出的用户对文本603选中连续的多句话的场景的示意图。
示例性的,在手机显示图8a之后,随着录音的播放进度的增加,手机可显示如图8d(1)所示的显示界面601,用户可选中文本2(具体文本内容参照图8d(1)示出的文本2,这里不再赘述),那么文本2中的起始字符的位置为字符“小”所处的位置4,终止字符的位置为字符“的”所处的位置5。那么该用户选中文本的操作可产生两个点击事件,分别是对位置4的点击事件和对位置5的点击事件。那么文本视图模块可处理该两个点击事件,以获取该位置4的坐标(x4,y4),以及位置5的坐标(x5,y5)。然后,文本视图模块可根据坐标(x4,y4)来获取文本603中位于位置4之前的文本的总字数offsetCount3,以及根据坐标(x5,y5)来获取文本603中位于位置5之前的文本的总字数offsetCount4。
S305,定点播放模块基于上述文本总字数和原始记录数据进行定点计算,确定至少一个定点播放位置。
在一种可能的实施方式中,在用户对图8b中的文本603单击某个位置时,定点播放模块在执行S305时,可通过上述实施方式1来实现,具体实现过程可参照对实施方式1的介绍,这里不再赘述。
示例性的,在图8b的场景中,定点播放模块可确定图8b(1)中的位置1属于文本603中的第四句话,即定点播放位置为index(3),其中,index(3)为第四句话的标识。
在另一种可能的实施方式中,用户可对文本603选中半句话,一句话,或连续的多句话,定点播放模块在执行S305时,可通过上述实施方式2来实现,具体实现过程可参照对实施方式2的介绍。
示例性的,在图8c的场景中,定点播放模块可基于图8c(1)场景的示例中提及的总字数offsetCount1和总字数offsetCount2,来确定图8c(1)中的位置2,以及位置3均属于文本603中的第四句话,即定点播放位置为index(3),其中,index(3)为第四句话的标识。
示例性的,在图8d的场景中,定点播放模块可基于图8d(1)场景的示例中提及的总字数offsetCount3,来确定图8d(1)中的位置4,属于文本603中的第二句话,那么一 个定点播放位置为index(1),index(1)为第二句话在原始记录数据中的标识。以及定点播放模块可基于图8d(1)场景的示例中提及的总字数offsetCount4,来确定图8d(1)中的位置5,属于文本603中的第五句话,那么另一个定点播放位置为index(4),index(4)为第五句话在原始记录数据中的标识。
S307,定点播放模块更新播放进度以及在文本中标识目标文本片段。
在一种可能的实施方式中,在用户对文本603单击某个位置时,定点播放模块在执行S307时,可通过上述实施方式1来实现。
示例性的,在图8b的场景中,用户点击文本603中的某个位置,定点播放模块可依据原始记录数据和图8b(1)示出的位置1,确定单击位置在该录音1的原始记录数据中对应的语句的标识,这里为index(3)。
示例性的,结合图6b,参照图8b(2),标识index(3)用于标识文本603中第四句话,定点播放模块可从原始记录数据中获取第四句话的L与T,第四句话对应的L为l3,对应的T为t3,其中,t3为第四句话对应的音频片段的结束时间(例如结束播放时间),此外,如图6b所示的t2为第四句话对应的音频片段的起始时间(例如开始播放时间),这里的t2为当前播放时间控件6021示出的0分20秒。
示例性的,可参考手机的显示界面从图8b(1)变化为图8b(2)所示的显示界面,定点播放模块可将播放进度控件6024沿播放进度条6023移动,播放进度控件6024移动后的位置对应的当前播放时间为0分20秒,当前播放时间控件6021示出的时间从图8b(1)所示的0分5秒,更新为文本603中的第四句话的开始播放时间t2,这里为0分20秒。
示例性的,定点播放模块还可基于原始记录数据中记录的第一句话至第四句话中每句话的字符数量L,来确定文本603中对应于第四句话的目标文本片段,并将该目标文本片段以加粗斜体的方式显示在显示界面601,以区别于文本603中其他未被选中的文本的显示方式。这里的目标文本片段(例如第四句话)为“感受白花盛开风景如画的春天。”,以提醒用户待定点播放的文本内容。
在一些实施例中,定点播放模块在响应于用户的定点播放操作时,不仅可以更新播放进度,以及标识选中的目标文本片段,还可将录音的播放状态设置为暂停播放状态,如图8b(2)中的播放暂停控件6025示出的三角形的图标,该图标用于表示录音处于暂停播放状态。而图8b(1)中播放暂停控件6025示出的双竖线的图标,则用于表示录音处于播放状态。这样,用户可根据需要而灵活的选择播放目标文本片段的时机。
在另一种可能的实施方式中,用户可对图8c或图8d所示的文本603选中半句话,一句话,或连续的多句话,定点播放模块在执行S307时,可通过上述实施方式2来实现。
示例性的,在图8c的场景中,用户选中文本603中的半句话,定点播放模块可依据原始记录数据和图8c(1)示出的位置2,确定选中文本对应的起始语句在原始记录数据中的标识,这里为index(3)。以及定点播放模块可依据原始记录数据和图8c(1)示出的位置3,确定选中文本对应的终止语句在原始记录数据中的标识,这里的标识也为index(3),说明用户选中了一句话。
示例性的,结合图6b,参照图8d(2),标识index(3)用于标识文本603中第四句话, 定点播放模块可从原始记录数据中获取第四句话的L与T,第四句话对应的L为l3,对应的T为t3,其中,t3为第四句话对应的音频片段的结束时间(例如结束播放时间),此外,如图6b所示的t2为第四句话对应的音频片段的起始时间(例如开始播放时间),这里的t2为当前播放时间控件6021示出的0分20秒。
示例性的,可参考手机的显示界面从图8c(1)变化为图8c(2)所示的显示界面,定点播放模块可将播放进度控件6024沿播放进度条6023移动,播放进度控件6024移动后的位置对应的当前播放时间为0分20秒,当前播放时间控件6021示出的时间从图8c(1)所示的0分5秒,更新为文本603中的第四句话的开始播放时间t2,这里为0分20秒。
示例性的,如图8c(2)所示,定点播放模块还可基于原始记录数据中记录的第一句话至第四句话中每句话的字符数量L,来确定文本603中对应于第四句话的目标文本片段,并将该目标文本片段以加粗斜体的方式显示在显示界面601,以区别于文本603中其他未被选中的文本的显示方式。这里的目标文本片段(例如第四句话)为“感受白花盛开风景如画的春天。”,以提醒用户待定点播放的文本内容。
示例性的,在图8d的场景中,用户选中文本603中连续的多句话,定点播放模块可依据原始记录数据和图8d(1)示出的位置4,确定多句话中起始语句的标识index(1)。以及定点播放模块可依据原始记录数据和图8d(1)示出的位置5,确定多句话中终止语句的标识index(4)。
结合图6b,参照图8d(2),标识index(1)用于标识文本603中第二句话,定点播放模块可从原始记录数据中获取第二句话的L与T,第二句话对应的L为l1,对应的T为t1,其中,t1为第二句话对应的音频片段的结束时间(例如结束播放时间),此外,如图6b所示的t0为第二句话对应的音频片段的起始时间(例如开始播放时间),这里的t0为0分2秒。以及结合图6b,参照图8d(2),标识index(4)用于标识文本603中第五句话,定点播放模块可从原始记录数据中获取第五句话的L与T,第五句话对应的L为l4,对应的T为t4,其中,t4为第五句话对应的音频片段的结束时间(例如结束播放时间)。
示例性的,可参考手机的显示界面从图8d(1)变化为图8d(2)所示的显示界面,定点播放模块可将播放进度控件6024沿播放进度条6023移动,播放进度控件6024移动后的位置对应的当前播放时间为0分2秒,当前播放时间控件6021示出的时间从图8d(1)所示的0分5秒,更新为文本603中的第二句话(即选中的多句话中的起始语句)的开始播放时间t0,这里为0分2秒。
示例性的,如图8d(2)所示,定点播放模块还可基于原始记录数据中记录的第一句话至第五句话中每句话的字符数量L,来确定文本603中对应于用户选择的多句话中的起始语句(这里为第二句话)对应的目标文本片段,以及终止语句(这里为第五句话)对应的目标文本片段,以及位于起始语句和终止语句之间的第三句话、第四句话各自对应的目标文本片段。并将文本603中的第二句话至第五句话(均为目标文本片段)以加粗斜体的方式显示在显示界面601,以区别于文本603中其他未被选中的文本的显示方式,以提醒用户待定点播放的文本内容。
可选地,S309,音频控制模块接收播放操作。
示例性的,如图8b(2)、图8c(2)、图8d(2)所示,用户可点击上述三个附图中任意一个附图中示出的点击播放暂停控件6025,那么音频控制模块可接收到播放操作。当然,该播放操作的触发方式并不限于这里举例的点击播放暂停控件6025。
示例性的,录音机应用也可以在S307之后,自动播放定位的目标音频片段,而无需用户触发播放操作。
例如,在图8d示出的用户在文本603中选中连续的多句话的场景中,在手机显示图8d(1)之后,用户在文本603中选中文本2,然后,手机可将显示界面切换为图8d(3),无需用户点击播放暂停控件6025,即可实现对选中的多句话的自动播放。
S310,定点播放模块根据实时播放的目标音频片段,在文本中标识对应的目标文本片段。
在一种可能的实施方式中,结合于上述实施方式1,在图8b的场景中,用户单击文本603中的某个位置。示例性的,如图8b(2)所示,用户点击播放暂停控件6025,录音机应用可响应于该用户操作,从播放时间为0分20秒(这里为通过单击位置定位的一句话的开始播放时间)的位置处开始继续播放录音。
示例性的,如图8b(3)和图8b(4)所示,作为目标音频片段的第四句话“感受白花盛开风景如画的春天。”的开始播放时间为t2,t2的取值为图8b(3)中当前播放时间控件6021示出的0分20秒,第四句话的结束播放时间为t3,t3的取值为图8b(4)中当前播放时间控件6021示出的0分25秒。其中,第五句话的开始播放时间t3的取值也为0分25秒。其中,第五句话为“公园中的春色醉人,”。
示例性的,定点播放模块可响应于图8b(2)中用户点击播放暂停控件6025的操作,而显示如图8b(3)所示的显示界面601,录音机应用可从文本603中的第四句话的开始播放时间(这里为0分20秒)开始播放第四句话的录音,并且,将文本603中的第四句话以加粗斜体的方式显示。
示例性的,在录音的当前播放进度为目标音频片段的播放结束时间(例如第四句话的播放结束时间为0分25秒)时,如图8b(4)所示,定点播放模块可将该第四句话的显示方式恢复为原显示方式(显示效果可参照图8b(4)中第四句话的显示方式)。并且,定点播放模块可将待播放的下一个目标音频片段(这里的第五句话)的显示方式设置为预设显示方式(例如加粗斜体的显示方式),同理,在第五句话播放结束后,将第五句话的显示方式恢复为原显示方式,将第六句话的显示方式设置为预设显示方式……,直至文本603播放结束,即当前播放时间控件6021示出的当前播放时间为1分钟。
在本实施例中,用户可通过对转换后的文本中的任意一个位置进行单击操作,手机就可基于单击操作所对应的位置,来对单击位置所属的目标音频片段进行播放,以及单击位置所属的目标文本片段以预设显示方式显示。并且,在该目标文本片段播放结束后,将该目标文本片段的显示方式恢复为原显示方式,以及手机还可自动将下一句话对应的音频片段作为目标音频片段播放,以及将下一句话作为目标文本片段,以预设显示方式显示。那么用户只需要对转换后的文本中的任意一个位置进行单击操作,录音机应用就可以自动定位用户选中的目标文本片段之后的目标文本片段,实现后续每一句话的自动 定点播放,以便于用户可以随时检查后续每句话,是否存在文本的转换错误情况,利于用户纠正文本错误,提升文本编辑效率。
在另一种可能的实施方式中,结合于上述实施方式2,用户可对图8c或图8d所示的文本603选中半句话,一句话,或连续的多句话。
示例性的,当用户在对录音转换后的文本(例如图8c所示的文本603)中选中半句话或一句话时,定点播放模块在对定位的一句话进行播放时,可进行自动定点播放,自动定点播放的过程可参照对图8b(3)和图8b(4)的相关描述,这里不再赘述。
示例性的,在图8c示例的用户在文本603中选中半句话或一句话的场景,如图8c(2)所示,用户点击播放暂停控件6025,录音机应用也可以响应于该用户操作,从播放时间为0分20秒的位置处开始继续播放录音,并将文本603中的第四句话以预设显示方式进行显示,在该第四句话播放结束后,后续播放过程可参考图8b(3)和图8b(4)的实施例的相关描述,这里不再赘述,这样可以实现用户进行一次正向定点播放操作,录音机应用可对该正向定点播放操作对应的目标文本片段之后的每句话进行自动定点播放。
示例性的,在图8d示例的用户在文本603中选中连续的多句话的场景,如图8d(2)所示,用户点击播放暂停控件6025,录音机应用也可以响应于该用户操作,如图8d(3)所示,从当前播放时间控件6021示出的0分2秒的位置处开始继续播放选中的第二句话到第五句话的音频片段,并将文本603中的第二句话至第五句话以预设显示方式进行显示,在该第五句话播放结束后,后续播放过程可参考图8b(3)和图8b(4)的实施例的相关描述,这里不再赘述,这样可以实现用户进行一次正向定点播放操作,录音机应用可对该正向定点播放操作对应的目标文本片段之后的每句话进行自动定点播放。
可选地,在S309之后,定点播放模块也可以不执行S310,而是仅对用户在文本603中选中的文本所对应的目标文本片段以及目标音频片段进行定点播放。
示例性的,结合于上述实施方式1,如图8b(3)所示,可播放文本603中的第四句话(这里为“感受白花盛开风景如画的春天。”)的音频片段,在第四句话的音频片段播放完成之后,则自动暂停播放该录音1,以使播放暂停控件6025显示三角形的图标,以表示该录音1处于暂停播放状态。
示例性的,结合于上述实施方式2,在图8c示例的用户在文本603中选中半句话或一句话的场景中,如图8c(2)所示,用户点击播放暂停控件6025,录音机应用也可以响应于该用户操作,从播放时间为0分20秒的位置处开始继续播放录音,并将文本603中的第四句话以预设显示方式进行显示,在该第四句话的音频片段播放结束后,则自动暂停播放该录音1,以使播放暂停控件6025显示三角形的图标,以表示该录音1处于暂停播放状态。并且,录音机应用还可将当前播放时间控件6021示出时间为第四句话的结束播放时间,即t3。
示例性的,结合于上述实施方式2,在图8d示例的用户在文本603中选中多句话的场景中,如图8d(2)所示,用户点击播放暂停控件6025,录音机应用也可以响应于该用户操作,如图8d(3)所示,从当前播放时间控件6021示出的0分2秒的位置处开始继续播放选中的第二句话到第五句话的音频片段,并将文本603中的第二句话至第五句话以 预设显示方式进行显示。在该第五句话播放结束时,如图8d(4)所示,录音机应用可自动暂停播放该录音1,以使播放暂停控件6025显示三角形的图标,以表示该录音1处于暂停播放状态。并且,当前播放时间控件6021示出第五句话的播放结束时间t4,这里为0分30秒。
在一种可能的实施方式中,当用户对转换后的文本进行单击操作,以实现正向定点播放时,例如在图8a的场景下,手机可以自动定点播放的方式,在单击位置对应的目标音频片段播放结束后,自动播放下一个音频片段,以及将下一个音频片段对应的文本片段以预设显示方式显示,如此循环,直至整段录音播放结束。该实施方式中,当用户需要自动定点播放时,可对转换后的文本中的任意位置进行单击,便于在自动定点播放场景下的用户操作。
在另一种可能的实施方式中,当用户对转换后的文本进行至少两个字符(例如半句话、一句话,连续的多句话)进行选中的场景下,手机在进行正向定点播放时,可优选对用户选中的文本所属的目标音频片段进行播放,以及对用户选中的文本所属的目标文本片段进行预设显示方式显示,在目标音频片段播放结束后,则手机不再继续定点播放位于该目标音频片段之后的音频片段,手机可自动暂停播放该录音,或者,手机可循环播放该目标音频片段以及在循环播放过程中,该目标文本片段一直以预设显示方式显示。在该实施方式中,当用户在转换后的文本中选中至少两个字符时,说明用户可能当前只对该至少两个字符对应的目标文本片段感兴趣而需要进行修改,则手机可只对选中字符所属的目标音频片段及目标文本片段进行播放或突出显示,从而便于用户收听播放的目标音频片段,来纠正目标文本片段中的字符。
可选地,S311,音频控制模块接收暂停播放操作。
示例性的,当用户需要对图8a至图8d中任意附图对应场景下所示出的文本603进行修改时,如果录音1处于播放状态,则用户可点击图8a至图8d中任意附图示出的播放暂停控件6025,以对录音1进行暂停播放。
可选地,可在S310之后,执行S311。例如当用户重听正向定点播放的目标音频片段后,可确定需要对该目标音频片段对应的目标文本片段中存在需要修改的地方,那么用户触发上述暂停播放操作,以暂停录音1的播放。
S312,文本视图模块编辑文本。
示例性的,用户可根据播放的音频内容,来对显示的该音频对应的文本进行增删改等操作,使得语句通顺。
示例性的,参照图8e(1),用户点击显示界面601中文本603内的位置6,该位置6为文本603中第二句话的字符“友”和字符“在”之间的位置,录音机应用可获取到位置6在文本603中的坐标(x6,y6),然后,录音机应用可基于实施方式1的方案来确定位置6对应于原始记录数据中的语句的标识index(i),这里录音机应用确定的标识为index(1),即定位的语句为第二句话。
示例性的,如图8e(2)所示,用户在位置6添加了逗号“,”,使得文本603中的第二句话的字符数量加一。
S313,文本视图模块反向刷新原始记录数据。
示例性的,参照图8e,用户对文本的编辑操作(这里为单击操作),使得文本603中的第二句话的字符数量更新,那么文本视图模块可将录音1(或者说文本603)的原始记录数据中标识为index(1)的第二句话的L的取值加一,以反向刷新原始记录数据。
示例性的,在用户对文本的编辑操作,使得所编辑的文本对应的语句(标识为index(i))的字符数量,相比于原始记录数据中标识为index(i)的语句的字符数量li发生变化(增加或减少)时,则根据编辑后的标识为index(i)的语句的字符数量,来对原始记录数据中相应语句的字符数量li进行更新。
示例1,当用户在文本603中进行文本编辑的位置位于两句话之间时,例如用户点击图8e(1)所示的文本603中的文本“近日,”与文本“记者”之间的目标位置,并在目标位置增加一句话或一段话或至少一个字符时,则录音机应用在基于该目标位置,来确定用户单击位置在原始记录数据中所属的第几句话时,参照上文实施方式1的描述可知,录音机应用可确定的定点播放位置为第一句话。那么用户在目标位置所增加的文本的字符数量a可以追加到第一句话的l1,使得更新后的l1(例如l1’)=l1+a。
在本申请实施例中,在依据用户对文本的编辑操作,来对原始记录数据中的目标文本片段(例如本申请所定义的一句话)的字符数量L进行更新时,该编辑操作在两句话之间时,该编辑操作所增加的字符数量可追加至前一句话的字符数量L中,以提升用户体验。
例如,将该编辑操作所增加的字符数量追加到后一句话的字符数量L中,由于录音中对应于该两句话(例如第一句话和第二句话)之间的位置可能不具有音频数据。那么如果用户该编辑操作之后,用户通过操作来使录音机应用正向定点播放第二句话的音频片段,那么该音频片段中并不具有该第二句话中上述编辑操作所新增的文本内容对应的音频数据,则会造成输出的音频片段的开始部分,与第二句话的开始部分语义不匹配,从而影响用户对文本的编辑操作,用户体验较差。
示例2,当用户对文本的编辑操作为:在由音频转换成的文本的最开始位置(即第一句话之前)增加字符,则增加的字符的数量,可追加到该文本的原始记录数据中,第一句话的字符数量L中。例如用户在图8e(1)中的文本“近日,”之前增加了目标文本“本台记者报道,”,目标文本的字符数量为7,文本603的第一句话的字符数量l0为3,那么可按照目标文本的字符数量为7,对字符数量l0进行追加,使得更新后的字符数量l0=10。
需要说明的是,虽然录音机应用在反向刷新原始记录数据时,可对相应语句的字符数量L进行更新,但是,该语句对应的时间戳T并没有发生变化。
示例性的,参照图8f(1),用户在对文本进行编辑操作时,还可对需要编辑的文本进行选中(也可以单击某个字符),而非图8e所示的单击文本中的某个位置(该位置不存在字符)。如图8f(1)所示,用户可选中文本1(这里为文本“盛开风景如画”),录音机应用可响应于该用户选中文本(至少一个字符)的操作,在显示界面601显示控件604。示例性的,控件604可显示在用户选中的多句话的附近区域,本申请对于控件604在显示界面601中的显示位置不做限制。控件604可包括“复制”选项、“剪切”选项、“全选” 选项、“翻译”选项、“播放”选项、“分享”选项。
可选地,录音机应用也可以只在用户对转换后的文本中选择至少两个字符(例如选中半句话、一句话、连续的多句话的任意场景)时,显示该控件604。并且,在触发对选中的至少两个字符所属的目标音频片段进行播放的操作时,可以是用户点击该控件604中的“播放”选项,而非单击播放暂停控件6025。示例性的,控件604中的“播放”选项在被触发时,可用于仅对选中的至少两个字符所属的目标音频片段进行播放,而不对该目标音频片段之外的音频片段进行播放。
示例性的,用户点击控件604中的“复制”选项,则录音机应用可对图8f(1)中选中的文本1进行复制操作。
示例性的,用户点击控件604中的“剪切”选项,则录音机应用可对图8f(1)的文本1进行剪切操作。该剪切操作引起的文本1对应的第四句话的字符数量L发生变化,将文本1剪切之后,第四句话的字符数量l3更新为8。其中,编辑后的第四句话对应的文本为“感受白花的春天。”,其字符数量为8。
示例性的,用户点击控件604中的“全选”选项,则录音机应用可对图8f(1)的文本1进行全选操作。
示例性的,用户点击控件604中的“翻译”选项,则录音机应用可对图8f(1)的文本1进行翻译操作,例如中译英,英译中等,翻译的源语言与目标语言可预先配置,这里不做限制。
示例性的,用户点击控件604中的“播放”选项,则录音机应用可对图8f(1)的文本1,按照上述实施方式2的方法,来确定定点播放位置,例如文本1属于原始记录数据中的第几句话,这里为第四句话,然后,显示图8f(2)所示的显示界面601,以对第四句话进行定点播放。关于从图8f(1)变化为图8f(2)的正向定点播放的实现原理,与图8c所示的方案类似,这里不再赘述。
在一些实施例中,在上述图8b(1)、图8c(1)、图8d(1)中,用户选择文本603中的至少一个字符后,同样可触发显示图8f(1)所示的控件604,且控件604的使用方法与图8f实施例中介绍的相同,这里不再赘述。
在图7a示出的正向定点播放的场景中,本申请实施例的录音机应用可在录音转文本的同时,获取与存储原始记录数据。在结束录音转文本的操作后,用户通过点击转换后的文本中,语义不通顺的语句或者选择指定的几句话来进行音频片段的正向定点播放,示例性的,正在播放的语句可以蓝色来显性提示用户。用户可依据播放的音频内容,对文本进行增删改等编辑操作,该编辑操作引起相应语句的字符数量的变化时,录音机应用可刷新原始记录数据,保证后续的正向定点播放或反向定点播放的场景下,定点播放的音频片段和以预设显示方式显示的文本片段依然准确。
在本申请的实施例中,电子设备可通过用户对转换后的文本的操作位置(例如光标位置),基于记录的关于T与L的映射关系的原始记录数据,来定位一句话对应的音频片段以及文本片段,从而确定音频片段在完整的录音中的具体位置,以及文本片段在完整的文本中的具体位置,可实现正向定点播放。电子设备还可通过用户对转换后的文本 选中的至少两个字符(半句话或一句话或连续的多句话),基于记录的关于T与L的映射关系的原始记录数据,来进行定位至少一个音频片段以及至少一个文本片段,从而确定至少一个音频片段在完整的录音中的具体位置,以及至少一个文本片段在完整的文本中的具体位置,可实现正向定点播放。
图7b为示例性示出的反向定点播放场景下,该录音机应用的数据处理过程的示意图,可结合图6b以及图8a、图8g的场景示意图来理解。
如图7b所示,该过程可包括如下步骤:
S301,音频控制模块获取录音的原始记录数据。
示例性的,本步骤的实现原理,与图7a中的S301相同,这里不再赘述。
S302,音频控制模块响应于对录音的播放进度条的定点播放操作,获取定点播放时间。
示例性的,在图8a之后,随着播放进度的增加,手机的显示界面切换为如图8g(1)所示的显示界面601,播放进度控件6024在播放进度条6023上位于进度p1对应的位置,进度p1为当前播放时间控件6021示出的当前播放时间,这里为0分5秒。
示例性的,如图8g(1)所示,用户沿箭头方向将播放进度控件6024从进度p1,拖动至图8g(2)所示的进度p2对应的位置,进度p2为图8g(2)中当前播放时间控件6021示出的当前播放时间,这里为0分21秒。那么该用户拖动播放进度控件6024的操作可产生对录音1的当前播放时间的变化,从而触发执行更新回调函数。录音机应用在执行更新回调函数时,可根据播放进度控件6024,在播放进度条6023上的当前位置(例如进度p2),获取当前播放时间progressTime(上述S302所述的定点播放时间的一个示例),这里为0分21秒。
S304,定点播放模块基于上述定点播放时间和原始记录数据进行定点计算,确定定点播放位置。
示例性的,可参照上述实施方式3的方法,录音机应用可基于该录音1的原始记录数据,来确定当前播放时间progressTime属于t2与t3之间,则可以确定index(i)=index(3),其中,index(3)为第四句话的标识,可以确定定点播放位置为第四句话。
S307,定点播放模块更新播放进度以及在文本中标识目标文本片段。
示例性的,结合于上述实施方式3,示例性的,在图8g的场景中,结合图6b,参照图8g(3),标识index(3)用于标识文本603中第四句话,定点播放模块可从原始记录数据中获取第四句话的L与T,第四句话对应的L为l3,对应的T为t3,其中,t3为第四句话对应的音频片段的结束时间(例如结束播放时间),此外,如图6b所示的t2为第四句话对应的音频片段的起始时间(例如开始播放时间),这里的t2为当前播放时间控件6021示出的0分20秒。
示例性的,可参考手机的显示界面从图8g(2)变化为图8g(3)所示的显示界面,定点播放模块可将播放进度控件6024在播放进度条6023中的所处的位置,从进度p2对应的位置,切换为进度p3对应的位置。示例性的,进度p3可为图8g(3)中当前播放时间控件6021示出的当前播放时间,这里为0分20秒。此外,定点播放模块还将当前播 放时间控件6021示出的当前播放时间从0分21秒切换为0分20秒,从而将当前播放时间调整为定位的第四句话的开始播放时间t2。
示例性的,定点播放模块还可基于原始记录数据中记录的第一句话至第四句话中每句话的字符数量L,来确定文本603中对应于第四句话的目标文本片段,并将该目标文本片段以加粗斜体的方式显示在显示界面601,以区别于文本603中其他未被选中的文本的显示方式。这里的目标文本片段(例如第四句话)为“感受白花盛开风景如画的春天。”,以提醒用户待定点播放的文本内容。
在一些实施例中,定点播放模块在响应于用户的定点播放操作时,不仅可以更新播放进度,以及标识选中的目标文本片段,还可将录音的播放状态设置为暂停播放状态,如图8g(3)中的播放暂停控件6025示出的三角形的图标,该图标用于表示录音处于暂停播放状态。这样,用户可根据需要而灵活的选择播放目标文本片段的时机。
可选地,S309,音频控制模块接收播放操作。
示例性的,本步骤的实现原理,与图7a中的S309相同,这里不再赘述。
S310,定点播放模块根据实时播放的目标音频片段,在文本中标识对应的目标文本片段。
示例性的,本步骤的实现原理,与图7a中的S310相同,这里不再赘述。
可选地,S311,音频控制模块接收暂停播放操作。
示例性的,本步骤的实现原理,与图7a中的S311相同,这里不再赘述。
S312,文本视图模块编辑文本。
示例性的,本步骤的实现原理,与图7a中的S312相同,这里不再赘述。
S313,文本视图模块反向刷新原始记录数据。
示例性的,本步骤的实现原理,与图7a中的S313相同,这里不再赘述。
在图7a示出的正向定点播放的场景中,本申请实施例的录音机应用可在录音转文本的同时,获取与存储原始记录数据。在结束录音转文本的操作后,用户通过拖动播放进度条来反向定点播放,示例性的,正在播放的语句可以蓝色来显性提示用户。用户可依据播放的音频内容,对文本进行增删改等编辑操作,在该编辑操作引起相应语句的字符数量的变化时,录音机应用可刷新原始记录数据,保证后续的正向定点播放或反向定点播放的场景下,定点播放的音频片段和以预设显示方式显示的文本片段依然准确。
需要说明的是,上述图5、图8a至图8g中相同的附图标记表示相同的对象,因此,未对各附图的附图标记做逐一解释说明,上述各附图中未提及的附图标记可参照上述图5、图8a至图8g中相同的已提及的附图标记的解释说明,这里不再赘述。
在本申请的上述实施例中,可在电子设备本地来对录音转换为文本,并且,可在录音过程中,实时的进行文本转换,以提升获取的原始记录数据中各句话的时间戳的准确性。在对录音转文本时,无需将录音上传云端,可提升文本转换效率。在用户检查到录音转换后的文本中存在不通顺的文本时,用户可通过点击需要定位的文本或拖动播放进度条的方式,来即可实现对文本和录音的定点播放,该定点播放的精准度较高,可提升对录音文件的文本的校对效率,以及编辑效率。
在一些实施例中,本申请的上述电子设备实现的技术方案还可应用于一个系统,该系统可包括通信连接的第一电子设备和第二电子设备,其中,第一电子设备具备音频录制功能。第二电子设备具备录音转文本功能、文本编辑功能以及定点播放功能。例如该系统可以应用到分布式麦克场景,其中,分布式麦克表示通信连接的至少两个电子设备的麦克风。
例如第一电子设备为录音笔、第二电子设备为手机,或平板电脑、或笔记本电脑等,以第二电子设备为平板电脑为例,录音笔可实时录制音频,并将实时录制的音频发送至平板电脑,平板电脑可对实时接收到的音频进行文本转换,以获取原始记录数据。在录音结束后,用户可操作平板电脑,来播放录音以显示该录音对应的文本内容。以及用户可对录音或文本进行操作,以实现正向定点播放和反向定点播放,以及对文本的校对等编辑操作。在该至少两个电子设备的应用场景下,本申请的技术方案的具体实现方式的原理与该方案应用于一个电子设备的实现方式类似,这里不再赘述。
那么由该系统实现本申请的技术方案,可一个用户使用第一电子设备在现场录制音频,另一个用户使用第二电子设备同步接收到实时录制的音频以及转换后的文本,利于提升文本编辑效率和文本校对效率。
可以理解的是,电子设备为了实现上述功能,其包含了执行各个功能相应的硬件和/或软件模块。结合本文中所公开的实施例描述的各示例的算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以结合实施例对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
一个示例中,图9示出了本申请实施例的一种装置300的示意性框图装置300可包括:处理器301和收发器/收发管脚302,可选地,还包括存储器303。
装置300的各个组件通过总线304耦合在一起,其中总线304除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图中将各种总线都称为总线304。
可选地,存储器303可以用于前述方法实施例中的指令。该处理器301可用于执行存储器303中的指令,并控制接收管脚接收信号,以及控制发送管脚发送信号。
装置300可以是上述方法实施例中的电子设备或电子设备的芯片。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
本实施例还提供一种计算机存储介质,该计算机存储介质中存储有计算机指令,当该计算机指令在电子设备上运行时,使得电子设备执行上述相关方法步骤实现上述实施例中的数据处理方法。
本实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的数据处理方法。
另外,本申请的实施例还提供一种装置,这个装置具体可以是芯片,组件或模块, 该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例中的数据处理方法。
其中,本实施例提供的电子设备、计算机存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
本申请各个实施例的任意内容,以及同一实施例的任意内容,均可以自由组合。对上述内容的任意组合均在本申请的范围之内。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
结合本申请实施例公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于网络设备中。当然,处理器和存储介质也可以作为分立组件存在于网络设备中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。

Claims (18)

  1. 一种数据处理方法,其特征在于,包括:
    响应于接收到的第一用户操作,在将音频数据转换为文本数据的过程中,获取第一信息;
    其中,所述音频数据为实时采集的音频数据,所述第一信息包括第一音频片段的第一时间戳与第一文本片段的第一字符数量的第一映射关系,其中,所述第一文本片段为所述第一音频片段的第一文本转换结果,所述音频数据包括至少一个所述第一音频片段,所述文本数据包括至少一个所述第一文本片段,所述第一时间戳为用于标识所述第一音频片段的起始时间点或结束时间点的时间戳;
    响应于接收到的第二用户操作,基于所述第一信息,将所述音频数据的播放进度更新至第二音频片段的第一起始时间点,以及将第二文本片段以预设显示方式显示;
    其中,所述第二文本片段为所述第二音频片段的第二文本转换结果;
    所述第二音频片段包括至少一个所述第一音频片段,所述第二文本片段包括至少一个所述第一文本片段。
  2. 根据权利要求1所述的方法,其特征在于,所述第一文本片段中的最后一个字符为预设标点符号,其中,所述预设标点符号为语义表示断句的标点符号。
  3. 根据权利要求2所述的方法,其特征在于,所述在将音频数据转换为文本数据的过程中,获取第一信息,包括:
    在将所述音频数据转换为所述文本数据的过程中,检测到所述音频数据的第三文本转换结果的类型为中间结果,基于所述第三文本转换结果中的所述预设标点符号,获取第二时间戳;
    基于所述第二时间戳,记录或更新与所述中间结果对应的,排列次序与时间戳的第一对应关系;
    其中,所述第二时间戳用于标识所述第三文本转换结果的生成时间,所述排列次序用于表示所述中间结果中所述预设标点符号的排列次序;
    在将所述音频数据转换为所述文本数据的过程中,检测到所述音频数据的第四文本转换结果的类型为最终结果,基于第二对应关系和所述第四文本转换结果,获取所述第一映射关系;
    其中,所述第二对应关系为与最近一次检测到的中间结果,对应的所述第一对应关系。
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述第三文本转换结果中的所述预设标点符号,获取第二时间戳,包括:
    检测到所述第三文本转换结果包括所述预设标点符号,且所述第三文本转换结果为首个中间结果,获取所述第二时间戳;或,
    检测到所述第三文本转换结果中所述预设标点符号的第一数量,大于上一次的第三文本转换结果中所述预设标点符号的第二数量,获取所述第二时间戳;或,
    检测到所述第一数量小于所述第二数量,获取所述第二时间戳。
  5. 根据权利要求3或4所述的方法,其特征在于,所述基于所述第二时间戳,记录或更新与所述中间结果对应的,排列次序与时间戳的第一对应关系,包括:
    在与所述中间结果对应的所述第一对应关系中,记录或增加一条最后一个排列次序与所述第二时间戳的对应关系;或,
    在与所述中间结果对应的所述第一对应关系中,将最后一个排列次序与时间戳的对应关系删除,以更新所述第一对应关系,并将更新后的所述第一对应关系中,与当前最后一个排列次序对应的时间戳更新为所述第二时间戳。
  6. 根据权利要求1至5中任意一项所述的方法,其特征在于,所述基于第二对应关系和所述第四文本转换结果,获取所述第一映射关系,包括:
    基于所述第二对应关系中的排列次序,确定所述第四文本转换结果中所述至少一个第一文本片段各自的所述第一字符数量;
    基于所述第二对应关系中相互对应的排列次序与时间戳,确定与所述第四文本转换结果对应的音频数据中,所述至少一个第一音频片段各自的所述第一时间戳;
    基于所述第二对应关系中的排列次序,获取所述第一时间戳与所述第一字符数量的第一映射关系,其中,排列次序相同的所述第一时间戳与所述第一字符数量相互映射。
  7. 根据权利要求1至6中任意一项所述的方法,其特征在于,所述响应于接收到的第二用户操作,基于所述第一信息,将所述音频数据的播放进度更新至第二音频片段的第一起始时间点,以及将第二文本片段以预设显示方式显示,包括:
    响应于接收到的所述第二用户操作,确定所述第一信息中的至少一个第一映射关系;
    基于所述至少一个第一映射关系和所述音频数据,确定至少一个第三音频片段,其中,所述至少一个第三音频片段中所述第一时间戳最早的第三音频片段为所述第二音频片段;
    基于所述至少一个第一映射关系和所述文本数据,确定至少一个第三文本片段,所述第二文本片段包括所述至少一个第三文本片段;
    基于所述第一信息,将所述音频数据的播放进度更新至所述第二音频片段的第一起始时间点;
    将所述文本数据中的所述第二文本片段以预设显示方式显示。
  8. 根据权利要求7所述的方法,其特征在于,所述第二用户操作包括对所述文本数据的第一操作,所述第一操作包括至少一个点击位置,所述响应于接收到的所述第二用户操作,确定所述第一信息中的至少一个第一映射关系,包括:
    响应于接收到的对所述文本数据的所述第一操作,基于所述第一信息和所述至少一 个点击位置,确定所述文本数据中,分别位于所述至少一个点击位置之前的字符的至少一个第二字符数量;
    基于所述第一信息和所述至少一个第二字符数量,确定所述第一信息中的至少一个第一映射关系。
  9. 根据权利要求8所述的方法,其特征在于,所述第一操作仅包括一个点击位置,所述至少一个第一映射关系的数量为一个。
  10. 根据权利要求8所述的方法,其特征在于,所述第一操作包括两个点击位置,所述第一操作用于选中至少两个字符,在所述至少一个第三音频片段的数量为多个时,多个所述第三音频片段为所述音频数据中播放时间连续的音频片段。
  11. 根据权利要求1至10中任意一项所述的方法,其特征在于,所述将所述音频数据的播放进度更新至第二音频片段的第一起始时间点,以及将第二文本片段以预设显示方式显示之后,所述方法还包括:
    响应于接收到的第三用户操作,从所述第二音频片段的第一起始时间点开始,按照所述至少一个第三音频片段各自的第一时间戳从早到晚的顺序,依次播放所述至少一个第三音频片段。
  12. 根据权利要求1至9或11中任意一项所述的方法,其特征在于,所述第一操作仅包括一个点击位置,所述至少一个第三音频片段的数量为一个,所述至少一个第三文本片段的数量为一个,所述第二音频片段与所述第三音频片段相同;
    所述依次播放所述至少一个第三音频片段之后,所述方法还包括:
    在播放至所述第三音频片段的第一结束时间点时,基于所述第一信息,继续播放第二起始时间点为所述第一结束时间点的第四音频片段;
    在播放至所述第三音频片段的第一结束时间点时,将所述第三文本片段的显示方式恢复为原显示方式,以及基于所述第一信息,将与所述第四音频片段对应的第四文本片段的显示方式从所述原显示方式更新为所述预设显示方式。
  13. 根据权利要求1至8,或10或11中任意一项所述的方法,其特征在于,所述第一操作用于选中的至少两个字符;
    所述依次播放所述至少一个第三音频片段之后,所述方法还包括:
    在播放至所述至少一个第三音频片段中第一时间戳最晚的第三音频片段的第二结束时间点时,暂停播放所述至少一个第三音频片段,以及将所述至少一个第三文本片段的显示方式恢复为所述原显示方式。
  14. 根据权利要求7所述的方法,其特征在于,所述第二用户操作包括对所述音频数据的播放进度的调整操作,所述调整操作包括所述音频数据的播放进度时间,所述响应 于接收到的所述第二用户操作,确定所述第一信息中的至少一个第一映射关系,包括:
    响应于接收到的对所述音频数据的播放进度的调整操作,基于所述第一信息和所述播放进度时间,确定所述音频数据中的一个第一映射关系;
    其中,所述至少一个第三音频片段的数量为一个,所述第三音频片段对应的时间范围内包括所述播放进度时间;
    其中,所述时间范围为所述第三音频片段的第三起始时间点和第三结束时间点构成的时间范围。
  15. 一种电子设备,其特征在于,包括:存储器和处理器,所述存储器和所述处理器耦合;所述存储器存储有程序指令,所述程序指令由所述处理器执行时,使得所述电子设备执行如权利要求1至14中任意一项所述的数据处理方法。
  16. 一种计算机可读存储介质,其特征在于,包括计算机程序,当所述计算机程序在电子设备上运行时,使得所述电子设备执行如权利要求1至14中任意一项所述的数据处理方法。
  17. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至14中任意一项所述的数据处理方法。
  18. 一种芯片,其特征在于,包括一个或多个接口电路和一个或多个处理器;所述接口电路用于从电子设备的存储器接收信号,并向所述处理器发送所述信号,所述信号包括存储器中存储的计算机指令;当所述处理器执行所述计算机指令时,使得所述电子设备执行权利要求1至14中任意一项所述的数据处理方法。
PCT/CN2023/083455 2022-03-31 2023-03-23 数据处理方法及电子设备 WO2023185641A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210335616.9A CN116935850A (zh) 2022-03-31 2022-03-31 数据处理方法及电子设备
CN202210335616.9 2022-03-31

Publications (1)

Publication Number Publication Date
WO2023185641A1 true WO2023185641A1 (zh) 2023-10-05

Family

ID=88199295

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/083455 WO2023185641A1 (zh) 2022-03-31 2023-03-23 数据处理方法及电子设备

Country Status (2)

Country Link
CN (1) CN116935850A (zh)
WO (1) WO2023185641A1 (zh)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080221882A1 (en) * 2007-03-06 2008-09-11 Bundock Donald S System for excluding unwanted data from a voice recording
CN108172247A (zh) * 2017-12-22 2018-06-15 北京壹人壹本信息科技有限公司 录音播放方法、移动终端及具有存储功能的装置
CN112102841A (zh) * 2020-09-14 2020-12-18 北京搜狗科技发展有限公司 一种音频编辑方法、装置和用于音频编辑的装置
US10917607B1 (en) * 2019-10-14 2021-02-09 Facebook Technologies, Llc Editing text in video captions
CN112732139A (zh) * 2021-01-12 2021-04-30 Oppo广东移动通信有限公司 录音处理方法、装置、移动终端及存储介质
CN113936699A (zh) * 2020-06-29 2022-01-14 腾讯科技(深圳)有限公司 音频处理方法、装置、设备及存储介质
CN113936697A (zh) * 2020-07-10 2022-01-14 北京搜狗智能科技有限公司 语音处理方法、装置以及用于语音处理的装置
CN114115674A (zh) * 2022-01-26 2022-03-01 荣耀终端有限公司 录音和文档内容的定位方法、电子设备及存储介质
CN114255751A (zh) * 2021-12-09 2022-03-29 阳光保险集团股份有限公司 音频信息提取的方法、装置、电子设备及可读存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080221882A1 (en) * 2007-03-06 2008-09-11 Bundock Donald S System for excluding unwanted data from a voice recording
CN108172247A (zh) * 2017-12-22 2018-06-15 北京壹人壹本信息科技有限公司 录音播放方法、移动终端及具有存储功能的装置
US10917607B1 (en) * 2019-10-14 2021-02-09 Facebook Technologies, Llc Editing text in video captions
CN113936699A (zh) * 2020-06-29 2022-01-14 腾讯科技(深圳)有限公司 音频处理方法、装置、设备及存储介质
CN113936697A (zh) * 2020-07-10 2022-01-14 北京搜狗智能科技有限公司 语音处理方法、装置以及用于语音处理的装置
CN112102841A (zh) * 2020-09-14 2020-12-18 北京搜狗科技发展有限公司 一种音频编辑方法、装置和用于音频编辑的装置
CN112732139A (zh) * 2021-01-12 2021-04-30 Oppo广东移动通信有限公司 录音处理方法、装置、移动终端及存储介质
CN114255751A (zh) * 2021-12-09 2022-03-29 阳光保险集团股份有限公司 音频信息提取的方法、装置、电子设备及可读存储介质
CN114115674A (zh) * 2022-01-26 2022-03-01 荣耀终端有限公司 录音和文档内容的定位方法、电子设备及存储介质

Also Published As

Publication number Publication date
CN116935850A (zh) 2023-10-24

Similar Documents

Publication Publication Date Title
JP7414842B2 (ja) コメント追加方法及び電子デバイス
CN105359121B (zh) 使用接收数据的应用远程操作
KR101743192B1 (ko) 녹음방법, 재생방법, 장치, 단말기, 시스템, 프로그램 및 기록매체
EP2607994B1 (en) Stylus device
CN114201097B (zh) 一种多应用程序之间的交互方法
US20140092101A1 (en) Apparatus and method for producing animated emoticon
US20160328105A1 (en) Techniques to manage bookmarks for media files
CN111527746B (zh) 一种控制电子设备的方法及电子设备
KR102023157B1 (ko) 휴대 단말기의 사용자 음성 녹음 및 재생 방법 및 장치
CN114020197B (zh) 跨应用的消息的处理方法、电子设备及可读存储介质
WO2023236794A1 (zh) 一种音轨标记方法及电子设备
CN108139895A (zh) 字体字型预览
WO2023185641A1 (zh) 数据处理方法及电子设备
CN117933197A (zh) 记录内容的方法、记录内容的呈现方法和电子设备
CN116682465B (zh) 记录内容的方法和电子设备
WO2023185590A1 (zh) 媒体信息的获取方法及电子设备
WO2024037421A1 (zh) 一种操作方法、电子设备及介质
WO2024046010A1 (zh) 一种界面显示方法、设备及系统
WO2024131584A1 (zh) 一种轨迹播放方法和装置
WO2022267786A1 (zh) 一种快捷图标展示方法与终端设备
CN117850731A (zh) 基于终端设备的自动朗读方法和设备
CN117201865A (zh) 一种编辑视频的方法、电子设备及存储介质
CN115186124A (zh) 一种音频搜索方法、装置、电子设备及存储介质
CN118259803A (zh) 一种内容获取方法及电子设备
CN116055799A (zh) 多轨道视频编辑方法、图形用户界面及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23778010

Country of ref document: EP

Kind code of ref document: A1