CN113672086A

CN113672086A - Page processing method, device, equipment and medium

Info

Publication number: CN113672086A
Application number: CN202110896067.8A
Authority: CN
Inventors: 田野
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2021-11-19

Abstract

The embodiment of the application provides a page processing method, a page processing device and a page processing medium, which support that when an image is contained in a target page in the process of displaying the target page, the image can be processed into templatized content description words by combining the mode recognition technology of the image to describe the semantics of the image, so that the content contained in the image and the expressed meaning can be provided more accurately, and the complete page information of the target page can be transmitted by combining the barrier-free reading capability. By adopting the method and the device, the image (non-text content) in the target page can be effectively read, and the completeness of page information reading is improved.

Description

Page processing method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for processing a page.

Background

The massive information on the internet is usually presented in a visual manner, and the visually impaired people are difficult to browse information by using the internet like healthy people, so that the visually impaired people cannot use the internet for working or communication.

At present, part of intelligent devices (or application programs deployed in the intelligent devices) provide a barrier-free screen reading function to help visually impaired people read information in a page, where the barrier-free screen reading function is to broadcast read information in a voice broadcast manner after the intelligent devices read the information in the page, so that the visually impaired people can receive the information through hearing. However, practice shows that the types of information that the conventional barrier-free screen reading function supports reading are limited, and usually only text contents in a page can be read, but some non-text contents often cannot be read or read incorrectly, which results in loss of part of information in the page and failure to provide more complete page information for visually impaired people.

Disclosure of Invention

The embodiment of the application provides a page processing method, a page processing device, page processing equipment and a page processing medium, which can effectively read images (non-text content) in a target page and improve the completeness of page information reading.

On one hand, an embodiment of the present application provides a page processing method, including:

displaying a target page;

if the target page contains the image, semantic description information of the image is acquired;

and playing prompt audio matched with the semantic description information of the image.

On the other hand, an embodiment of the present application provides a page processing apparatus, including:

a display unit for displaying a target page;

the processing unit is used for acquiring semantic description information of the image if the target page contains the image;

and the processing unit is also used for playing prompt audio matched with the semantic description information of the image.

In one implementation, semantic description information of an image is used to semantically describe the image;

the semantic description information includes at least one of: description information of a target color presented by the image, description information of text content contained in the image, description information of a source of the image, description information of an author of the image, description information of an object contained in the image, and description information of a behavior executed by the object contained in the image;

the cue audio is for alerting at least one of: the target color is used for presenting the prompt image, the text content contained in the prompt image, the source of the prompt image, the author of the prompt image, the object contained in the prompt image and the behavior executed by the object contained in the prompt image.

In one implementation, the processing unit is further configured to start a screen reading mode;

the processing unit is configured to, when playing a cue audio matched with the semantic description information of the image, specifically: and in the screen reading mode, playing prompt audio matched with the semantic description information of the image.

In one implementation, the target page further includes other content, and the other content includes at least one of the following: text, rich text, icons; a processing unit further to:

and in the screen reading mode, sequentially reading the audio matched with each content according to the arrangement sequence of each content in the target page.

In one implementation, the target page further includes operation information, and the operation information includes at least one of the following items: the information of the operator, the operated object, the type of the operated object, the information of the operation item, the feedback presented when the operation item is selected, and the change caused by the operation information to the target page; a processing unit further to:

and in the screen reading mode, converting the operation information in the target page into operation audio for playing and outputting.

In one implementation, a target page refers to any service page in a target application; the target application program supports a screen reading mode and provides an entrance of the screen reading mode; the processing unit is used for specifically starting a screen reading mode:

when the entrance of the screen reading mode is triggered, starting the screen reading mode;

wherein the entry of the screen reading mode comprises any one of: keys, icons, menu items, voice passwords.

In one implementation, if the target page is a first page, the image is a native image in the first page and does not support editing; the processing unit, when acquiring semantic description information of an image, is specifically configured to:

in the process of loading the first page, semantic description information of the image is acquired;

wherein the first page comprises any one of: the system comprises a webpage, a service page of an application program, a page of an applet program and a multimedia playing page.

In one implementation, if the target page is a second page, the image is added to the second page by an editing operation; the processing unit, when acquiring semantic description information of an image, is specifically configured to:

when an image is added to the second page, semantic description information of the image is acquired;

wherein the second page includes any one of: document editing pages, online document editing pages, social session pages.

In one implementation, the semantic description information includes description information of a target color presented to the image; the processing unit, when obtaining semantic description information of an image, is specifically configured to:

identifying S colors from the image, wherein S is an integer larger than 1;

acquiring the pixel number and the saturation of each color in the S colors;

multiplying the number of pixels of each color by the saturation to obtain the color value of each color;

determining the color corresponding to the maximum color score in the S colors as the target color of the image;

descriptive information for describing the target color is generated.

In one implementation, the semantic description information includes description information of text content contained in the image; the processing unit, when obtaining semantic description information of an image, is specifically configured to:

preprocessing the image;

carrying out image feature extraction on the preprocessed image to obtain image features;

classifying the image characteristics by adopting a classifier so as to identify the text content contained in the image;

and generating description information for describing the text content contained in the image.

In one implementation, the semantic description information includes description information of a source of the image and description information of an author of the image; the processing unit, when obtaining semantic description information of an image, is specifically configured to:

acquiring a source of an image;

if the source of the image indicates that the image is from the local space, reading an author of the image from the local space;

if the source of the image indicates that the image is from the network file, acquiring a link of the image, and reading an author of the image according to the link;

descriptive information of the source of the image and of the author of the image is generated.

In one implementation, the semantic description information includes description information of an object included in the image and description information of a behavior performed on the object included in the image; the processing unit, when obtaining semantic description information of an image, is specifically configured to:

calling a visual word list model to perform object identification processing on the image, and identifying to obtain an object contained in the image and a behavioral sentence pattern associated with the object;

generating description information on an object contained in the image, and generating description information on a behavior performed on the object contained in the image according to the object and a behavior pattern associated with the object.

In one implementation, the processing unit, when playing the cue audio matched with the semantic description information of the image, is specifically configured to:

creating a hidden document object node for the target page, and setting an auxiliary attribute for the hidden document object node;

processing semantic description information of the image into template texts in a templated form;

writing the template text into the hidden document object node;

when the writing operation is monitored, matching a prompt audio frequency for the written template text; and the number of the first and second groups,

and playing prompt audio.

In one implementation, the target page is rendered in a Canvas manner; a processing unit further to:

writing the content in the Canvas node into a hidden document object node;

when the write operation is monitored, matching content audio for the written content; and playing the content audio.

In one implementation, the processing unit is further configured to:

monitoring an operation event on a Canvas node and a feedback result of the operation event;

writing the operation event and the feedback result into a hidden document object node;

when the write operation is monitored, matching operation audio for the written operation event and the feedback result; and playing the operating audio.

In another aspect, the present application provides a computer device, the device comprising:

a processor for loading and executing a computer program;

a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the above-described page processing method.

In one aspect, the present application provides a computer-readable storage medium storing a computer program adapted to be loaded by a processor and to execute the above-mentioned page processing method.

In one aspect, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the page processing method.

In the embodiment of the application, for the displayed target page, if the target page contains an image, the semantic description information of the image can be acquired, and then the prompt audio matched with the semantic description information of the image is played. In the scheme, the semantic description information can describe the content expressed by the image semantically, and the semantic description information is converted into the matched prompt audio for broadcasting, so that the image (non-text content) in the target page can be effectively read, and the reading integrity of the page information is improved; in addition, the content expressed by the image can be accurately and completely expressed by playing the prompt audio, so that the semantic meaning of the image is assisted to be understood, richer page information can be provided in the process of reading the screen of the target page, and the intelligence of page screen reading processing is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1a is a diagram illustrating a page processing scenario provided by an exemplary embodiment of the present application;

FIG. 1b is a diagram illustrating a destination page provided by an exemplary embodiment of the present application;

FIG. 2 is a flowchart illustrating a page processing method according to an exemplary embodiment of the present application;

FIG. 3 is a diagram illustrating a target page containing images and other content provided by an exemplary embodiment of the present application;

FIG. 4a is a schematic diagram illustrating a flowchart for initiating a screen reading mode according to an exemplary embodiment of the present application;

FIG. 4b is a diagram illustrating a confirmation window provided by an exemplary embodiment of the present application;

FIG. 5a is a schematic diagram illustrating an audio indication provided by an exemplary embodiment of the present application;

FIG. 5b is a schematic diagram illustrating an audio indication provided by an exemplary embodiment of the present application;

FIG. 5c is a schematic diagram illustrating an audio indication provided by an exemplary embodiment of the present application;

FIG. 6 is a flowchart illustrating a page processing method according to an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram illustrating an HSL color scheme provided by an exemplary embodiment of the present application;

FIG. 8 is a diagram illustrating a destination page provided by an exemplary embodiment of the present application;

FIG. 9 is a flow chart illustrating a process for obtaining an image containing text content according to an exemplary embodiment of the present application;

FIG. 10 is a flow chart illustrating a method for obtaining information describing a source and an author of an image according to an exemplary embodiment of the present application;

FIG. 11a illustrates a schematic diagram of a visual vocabulary model provided by an exemplary embodiment of the present application;

FIG. 11b illustrates a schematic diagram of a visual vocabulary model provided by an exemplary embodiment of the present application;

fig. 12 is a flowchart illustrating listening to the content of a canvas node according to an exemplary embodiment of the present application;

FIG. 13 is a schematic structural diagram of a page processing apparatus according to an exemplary embodiment of the present application;

fig. 14 shows a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application relates to barrier-free screen reading, which is a screen reading mode for broadcasting contents contained in a page by adopting voice (or audio). The process of reading the screen without obstacles can comprise: and identifying the content contained in the page, outputting voice matched with the identified content in a voice broadcasting mode, wherein the voice can be used for describing the semantics of the identified content. Most of information on the internet is presented in a visual mode, so the barrier-free screen reading function has important significance for visually impaired people. For example: under the scene that the visually impaired people browse the webpage by adopting the browser, the webpage content on the webpage can be read based on the barrier-free screen reading function, and the webpage content is played in a voice broadcasting mode, so that the visually impaired people can be helped to conveniently acquire the webpage content contained in the webpage; one of the language descriptions for constructing a web page (or web page) is HTML5 (hypertext 5.0). The following steps are repeated: under the scene that the visually impaired people use the online documents for cooperative work, the document content in the online documents can be read based on the barrier-free screen reading function, and the document content is broadcasted in a voice mode, so that the visually impaired people can conveniently work and communicate through the Internet; the online document is a document tool which can be edited, viewed and collaborated by multiple persons online.

The barrier-free screen reading relates to a voice processing technology in the field of Artificial Intelligence (AI), and specifically, the page content recognized by the barrier-free screen reading function can be output in a voice manner through the voice processing technology. The artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and obtain the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. Among the key technologies of Speech processing Technology (Speech Technology) are automatic Speech recognition Technology (AS R), Speech synthesis Technology (TTS), and voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Specifically, a target application program (or simply, an application, such as any application program) with a barrier-free screen reading function can be used to implement barrier-free screen reading, where an application program refers to a computer program for performing one or more specific tasks; for example, the target application may include a document application with barrier-free screen reading functionality that may be used to open an online document, such as a Tencent document application. Depending on how the target application is running, the target application may include, but is not limited to: firstly, installing and running an application program in a terminal; terminals may include, but are not limited to: personal Computers (PCs), Personal Digital Assistants (PDAs), mobile phones, wearable devices, smart vehicles, and other smart devices. ② the application program without installation, namely the application program which can be used without downloading and installation, this kind of application program is also commonly called small program, and it is usually operated in the client as subprogram. A web application opened by the browser; and so on.

The embodiment of the application provides a page processing scheme, which not only supports the recognition of text contents (such as text contents contained in DOM (document object model) nodes) in a target page (such as any service page) and then adopts voice prompt to output; and the method also supports the situation that the target page contains the image, obtains the semantic description information of the image and further plays the prompt audio matched with the semantic description information of the image. The semantic description information can describe the content expressed by the image semantically, and then the semantic description information is converted into matched prompt audio to be played, so that the image in the target page can be effectively read, and the reading integrity of the page information is improved. In addition, the content expressed by the image can be accurately and completely expressed by playing the prompt audio, so that the semantic meaning of the image is assisted to be understood, richer page information can be provided in the process of reading the screen of the target page, and the intelligence of page screen reading processing is improved.

The page processing scheme proposed by the embodiment of the application can be executed by the computer device or executed by a target application (such as a document application) running in the computer device. The computer device may be any intelligent device with a barrier-free screen reading function, and the intelligent terminal may include but is not limited to: the present disclosure relates to a smart device capable of performing touch screen, and in particular, to a smart device capable of performing touch screen, such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a personal computer, a portable personal computer, a Mobile Internet Device (MID), a smart television, a vehicle-mounted device, and a head-mounted device. An exemplary page processing scenario is illustrated in FIG. 1a, which is illustrated in FIG. 1a, assuming that a target page is displayed in a computer device 101, which may belong to a document application; if the target page contains the image, the visual semantic meaning of the image can be read to obtain the semantic description information of the image, and then the voice prompt matched with the semantic description information of the image is output. Of course, the embodiment of the present application further includes a computer device 102, where the computer device 102 is a document application program or a backend server of the computer device 101, and may provide service support for the computer device 101 or the document application program; in this implementation, the operation of reading the visual semantics of the image mentioned in this embodiment of the application may be executed by the computer device 102, and the computer device 102 sends the semantic description information of the image identified to the computer device 101 or a document application running in the computer device 101, so that the computer device 101 converts the semantic description information into a matching prompt audio for playing; the content of the audio prompt 1011 played as shown in fig. 1a is "written on the picture: a pet that I have raised; the background color of the picture is light gray; there is a gray cat on the picture; the cat licks the cat paw; pictures from XX; the author of the picture is XX "; the embodiments of the present application do not limit the execution subject of the embodiments of the present application, and are described herein. It should be noted that the above mentioned images may be image frames contained in a video stream; in an actual application scene, when a visually impaired person selects a certain image frame of a video when playing the video, the selected image frame can be processed by adopting the page processing method provided by the embodiment of the application; and playing a prompt audio matched with the semantic description information of the image frame.

Practice shows that the page processing scheme provided by the embodiment of the application has obvious advantages when the target page is subjected to barrier-free screen reading. The following describes advantages of the embodiments of the present application by taking a comparison between the scheme of the present application and the existing barrier-free screen reading function as an example. The existing mainstream barrier-free screen reading function can only identify and broadcast the text content in the target page by voice; when the target page contains contents such as images, the format of the contents can be broadcasted only by simple voice; as shown in the first diagram of fig. 1b, the target page is a slide page, and the slide page contains characters and images, so when the existing mainstream barrier-free screen reading function is adopted to read the slide page, the output prompting audio is "pet that I nourish, [ image ]"; if the text "a pet I raise" is a text belonging to an image, the output of the hint audio is simpler, i.e., "[ image ]". For the visually impaired, the user cannot acquire the specific content of the image, and only knows that the slide page contains the image, so that the visually impaired cannot obtain the complete information of the page. When the page processing scheme provided by the embodiment of the application is adopted to perform barrier-free screen reading on the slide page shown in fig. 1b, not only can the characters 'pet that I nourish' in the slide page be read out, but also the objects, the behaviors of the objects and other information of the images can be acquired and broadcasted in voice; as shown in the second diagram of fig. 1b, the audio for prompting played by the embodiment of the present application is "the image has written characters: a pet that I have raised; the background color of the image is light gray; there was a gray cat on the image; the cat licks the cat paw; the image is from XX; the image author is XX … … ". Based on the method and the device, the semantic description information of the image in the target page can be obtained, the semantic description information is converted into the matched prompt audio, and the content expressed by the image can be accurately and completely expressed by playing the prompt audio, so that the semantic meaning of the image is assisted to be understood, and the intelligence of page screen reading processing is improved.

Based on the above-described page processing scheme, a more detailed page processing method is provided in the embodiments of the present application, and the page processing method provided in the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

FIG. 2 is a flowchart illustrating a page processing method according to an exemplary embodiment of the present application; the page processing method may be performed by a computer device (such as the computer device 101 mentioned above), and may include, but is not limited to, steps S201-S203:

s201: and displaying the target page.

The target page may be displayed in a display screen of a computer device (e.g., a smart device) when the target user (e.g., a user using the computer device 101) opens and uses the computer device. Alternatively, the target page may be an operating system page of the computer device, such as a configuration page for configuring functions of the computer device. Optionally, the target page may also refer to any service page in a target application program running in a computer device (such as an intelligent terminal); for example: the target application is a document application running in the computer device and having a document editing function, and any service page in the target application may include a document page of any document included in the document application, where any document may be an online document (such as an online table, an online text, an online slide, and the like), that is, a document that can be edited by multiple persons in a collaborative manner. For convenience of description, the following description will take an example of a document page in which a document application runs in a computer device, and a target page is any document included in the document application.

The target page may include images and other content, which may include at least one of: text, rich text, icons, video, audio, and so forth. The Text is called Plain Text (Plain Text) and refers to a representation of a written language; the text may include a plurality of characters, one or more of the characters forming a string of characters, and the characters may include at least one of: chinese characters (i.e., chinese characters), english characters (i.e., letters), numerals, and punctuation (e.g., comma, ", period,", brackets, "); for example, the target page contains a text "I'm foster pet" containing a plurality of characters, such as the character "I", the character "foster", the character "of", … …. Rich Text (Rich Text), or Rich Text format, is a contrast to plain Text, in which a variety of formats, such as font colors, pictures, tables, animations, and emoticons, etc., can be used. Icons may include, but are not limited to: functional icons and non-functional icons; the functional icon may be an icon that can be triggered to implement a certain function, such as an insert icon, and when the insert icon is clicked, content may be inserted into the target page; non-functional icons may refer to icons that are not triggerable, only functioning as a reminder or a view.

It should be noted that, the above is only an exemplary description of content that may be included in the target page, and in an actual application scenario, the target page may further include content that is not mentioned above, and the embodiment of the present application does not limit the number and the type of the content included in the target page.

S202: and if the target page contains the image, acquiring semantic description information of the image.

The semantic description information of the image is used for semantically describing the image; the semantic description information of the image may include at least one of: the description information of the target color presented by the image, the description information of the text content contained in the image, the description information of the source of the image, the description information of the author of the image, the description information of the object contained in the image, and the description information of the behavior executed by the object contained in the image. The object included in the image refers to the category individual included in the image, such as a cat, a person, an umbrella, a sofa, and the like included in the image; the behavior performed by the objects contained in the image refers to: the actions performed by the object. As shown in fig. 3, the target page 301 includes an image 3011, and the target color presented by the image 3011 is gray, for example, the target color of the image refers to a background color, and the background color of the image is gray; the text content contained in the image is 'a pet which is kept by oneself'; the image is derived from XX; the author of the image is XX; the object contained in the image is a cat; the object contained in the image, the action performed by the cat, is "licking the cat paw".

It can be understood that when the target page further includes other content, the embodiment of the application supports obtaining the description information of the other content when obtaining the semantic description information of the image; for example: if the target page also contains text, the description information of the text can be acquired. This facilitates subsequent voice broadcast of all content contained in the target page. Continuing to refer to fig. 3, as shown in fig. 3, if the target page 301 includes a text 3012 in addition to the image 3011, the text content of the text 3012 can be identified together to obtain description information of the text 3012, where the description information of the text 3012 is actually the text content; if the text content of the text 3012 is "title: XXXX ", the description information of the recognized text 3012 is" title: XXXXX ". It should be noted that the process of acquiring semantic description information of the image shown in step S202 is a process of performing pattern recognition on the image; the Pattern recognition (Pattern recognition) is to classify samples into certain categories according to their features by using a calculation method, and in the embodiment of the present application, the samples are images mentioned in the embodiment of the present application.

S203: and playing prompt audio matched with the semantic description information of the image.

And a voice broadcasting mode is adopted to play the prompt audio matched with the semantic description information of the image, so that the visually impaired can understand the semantic to be expressed of the image through the auditory sense. Wherein the cue audio that matches (or corresponds to) the semantic description information is usable to cue at least one of: the target color for prompting the image to present (the corresponding semantic description information comprises the description information of the target color for the image to present); the system is used for prompting the text content contained in the image; (the corresponding semantic description information includes description information of the text content contained in the image); a source for prompting the image (the corresponding semantic description information comprises description information of the source of the image); the system is used for prompting the author of the image (the corresponding semantic description information of the author of the image comprises description information of the author of the image); the method is used for prompting the object contained in the image (the corresponding semantic description information of the object comprises the description information of the object contained in the image); and a function for prompting the behavior executed by the object contained in the image (the corresponding semantic description information comprises the description information of the behavior executed by the object contained in the image).

It is understood that the above steps S201 to S203 are implemented after the screen reading mode is started; in other words, the barrier-free screen reading function is provided with a switch item, and the barrier-free screen reading function is started only when the screen reading mode is started, so that the screen reading mode can not be started for people with non-vision disorder. In a specific implementation, the embodiment of the present application supports starting the screen reading mode, so that playing the prompt audio matched with the semantic description information of the image shown in step S203 includes: and in the screen reading mode, playing prompt audio matched with the semantic description information of the image. When the target page refers to any service page in the target application program, the target application program is shown to support a screen reading mode; when the target page refers to any service page in the intelligent terminal, the intelligent terminal is indicated to support a screen reading mode; the embodiment of the present application is described by taking an example in which a target application supports a screen reading mode. The target application program provides an entrance of a screen reading mode, and when the entrance of the target application program is triggered, the screen reading mode of the target application program is started; wherein, the entry of the screen reading mode of the target application program may include any one of the following items: keys, icons, menu items, and voice passwords. Several implementations of initiating the screen reading mode are briefly described below, wherein:

(1) and starting a screen reading mode through a menu item. In a specific implementation, a menu control (or component, option, etc.) is displayed in a target page; when the menu control is triggered, triggering and displaying an option window, wherein the option window comprises one or more options, and the one or more options comprise screen reading options; and if the screen reading option is triggered, displaying a notification message in the target page, wherein the notification message is used for notifying that the screen reading mode is successfully started. The implementation of the screen reading mode is described in detail with reference to the flowchart of the screen reading mode shown in fig. 4 a; as shown in fig. 4a, a menu control 401 is displayed in a target page, and when the menu control 401 is triggered, an option window 402 is displayed, where the option window 402 includes a screen reading option 4021; if the screen reading option 4021 is triggered, which indicates that the user wants to start the barrier-free screen reading function of the target application program, a notification message 403 is displayed in the target page, where the notification message 403 indicates to start the screen reading mode of the target application program.

It should be noted that the above is only an exemplary implementation of the screen reading mode for starting the target application program; it can be understood that the implementation manner of the screen reading mode for starting the target application program in the actual application scene may also be changed; for example: according to the method and the device, after the screen reading option is triggered, a confirmation window can be output to inform the user again to confirm whether the screen reading mode of the target application program is started. The confirmation window comprises a confirmation option and a cancel option; and when the cancel option is triggered, the screen reading mode for starting the target application program is canceled by the user. An exemplary schematic of the confirmation window can be seen in fig. 4b, and as shown in fig. 4b, the confirmation window 404 includes a confirmation option 4041 and a cancel option 4042. In addition, the above description has been given by taking an example in which the option window and the confirmation window are displayed in a manner overlaid on the target page, but the option window and the confirmation window may also be displayed in a form of a single page.

(2) And starting a screen reading mode through a voice password. In the specific implementation, after the intelligent terminal starts the voice input function, the intelligent terminal can collect the audio frequency in the physical environment where the intelligent terminal is located, so that a user can speak a sentence similar to 'opening a barrier-free screen reading' in the state that the intelligent terminal starts the voice input function; after receiving the statement, the intelligent terminal can automatically start a screen reading mode. The intelligent terminal collects audio in the physical environment through a microphone arranged on the intelligent terminal.

(3) And starting a screen reading mode through a shortcut key. In specific implementation, the embodiment of the application supports quick opening of a screen reading mode through a shortcut key, for example, the shortcut key is ctrl + XX; therefore, the user can quickly realize the screen reading mode starting when inputting the shortcut key. Optionally, the user may input a shortcut key in a virtual keyboard displayed on a display screen of the intelligent terminal; optionally, the intelligent terminal may be externally connected to an entity keyboard, so that the user inputs a shortcut key on the entity keyboard. In summary, the embodiment of the application supports various implementation modes of starting the screen reading mode, enriches the modes of starting the screen reading mode, helps the visually impaired people to start the screen reading mode of the target application program or the intelligent terminal quickly, and improves the rapidness and the simplicity of starting the screen reading mode.

In addition, the embodiment of the application supports playing the audio matched with each content when the target page contains images and other contents (such as texts, icons and the like). In one implementation, in the screen reading mode, the audio matched with each content may be read sequentially according to the arrangement order of each content in the target page. For example, with continued reference to fig. 3, a return option 302, an application icon 303 of the target application, an application name 304 of the target application, and collection options 305, … …, etc. are displayed in the target page 301 shown in fig. 3 from left to right in the first row; the audio matching each content may be presented in turn to prompt the visually impaired to the content contained in the target page 301. In another implementation mode, in the screen reading mode, according to the selection operation (such as trigger operation, long-press operation, drag operation, double-click operation, and the like) of the content on the target page by the user, the audio matched with the selected content can be played; that is, when any content exists in the target page and is selected by the user, the semantic description information of the selected content is acquired, and the audio matched with the semantic description information of the any content is played. In other implementation manners, in the screen reading mode, the audio matched with each content can be read sequentially according to the arrangement sequence of each content in the target page; and when any content in the target page is triggered, stopping reading the current audio and playing the audio matched with the triggered content. For example, in the screen reading mode, the audio of the first content, the audio of the second content, and the audio of the third content are read in sequence; when the audio of the second content is read, and the triggering operation of the first content in the target page is detected, the reading of the audio of the second content is stopped, and the audio matched with the first content is played. The method can meet the demand of the user on demand and on demand, and improve the barrier-free screen reading experience of the user.

The audio matched with the target content can be used for prompting the basic content of the target content, the operation information related to the target content, the content format of the target content and the structure and the position of the target content; the target content is any one of a plurality of contents included in the target page. The following briefly introduces the content (such as basic content, operation information, etc.) indicated by the audio in conjunction with fig. 5a, 5b, and 5 c:

(1) as shown in FIG. 5a, the audio is used to prompt content-based and operational information for each content in the target page.

And according to different types of the content contained in the target page, the basic content corresponding to the content is different. For example, if the target page contains an image, the basic content of the image may include, but is not limited to, the following: target color presented by the image, text content contained by the image, source of the image, author of the image, object contained by the image, behavior executed by the object contained by the image, and the like; for another example, if the target page contains text, the basic content of the text may include, but is not limited to: the characters that make up the text, the font in which the characters are used, the color of the characters, etc.

The operational information may include at least one of: the information of the operator, the operated object, the type of the operated object, the information of the operation item, the feedback presented when the operation item is selected, and the change caused by the operation information to the target page. That is to say, the embodiment of the application supports that the operation information in the target page is converted into the operation audio for playing and outputting in the screen reading mode. For example, if the user selects a certain operation item (e.g., option) in the target page, the feedback presented when the operation item is selected can be played, e.g., the operation item is highlighted; the change of the target page caused by the selected operation item can be played, and if the operation item is selected, the content related to the operation item is displayed in the target page; and so on. For another example, assuming that the target application is a document application with a multi-user collaborative editing document, when any collaborator participating in collaborative editing edits the target page, the visually impaired person side can immediately play operation information related to the collaborator, such as information of the collaborator (i.e., an operator), an operated object, a type of the operated object, information of an operation item, feedback presented when the operation item is selected, and a change caused by the operation information on the target page. The method can help the visually impaired people to know the operation information of the collaborators in real time, and the visually impaired people can participate in normal work and life communication.

In summary, the embodiment of the present application can play the basic content of each content in the target page, and play all user operations (such as selection, deletion, modification, and the like) and system feedback (such as changes based on the user operating the target page) in the target page, and by playing the changes in the target page immediately, the visually impaired can sense the interaction between themselves and the target application program immediately, thereby improving the security of the visually impaired using the target application program.

(2) As shown in fig. 5b, the audio is used to prompt the content format of each content in the target page.

It is understood that the format of the content in the target page may have an impact on the user's perception; as shown in fig. 3, the content format of the first line of text is a text, and if only the basic content of the text is played during voice broadcast, the format information of the text is lost, thereby affecting the cognition and understanding of the visually impaired people on the text; therefore, the embodiment of the application supports the format of broadcasting the text together when the text is broadcasted by voice. For example, a text-tagged shape, which can be read as "square," "circle," etc.; as another example, a link with a text label, which may be read as a "link". The content format of a selectable playing content may include: broadcasting the format of the characters in the target page during broadcasting; the non-text content in the target page is 'text' as much as possible, so that the visually impaired can understand the non-text content conveniently.

(3) As shown in FIG. 5c, audio is used to suggest the structure and location of various content in the target page.

The concept that the structure and location of the content in the target page is "graphical", such as where the function menu is located in the target page, whether the function menu is a list or a grid, … …, is visually perceptible, which may result in the visually impaired being unable to more accurately understand the content in the target page if the output audio loses the structure and location of the content. Based on this, the embodiment of the application supports reading the current state of the target page to enter or leave a certain area (or content, functional area, etc.), and also supports reading the position of the current cursor (such as a mouse cursor) in the target page, so as to help the visually impaired people to position their own position in the target page in real time, help the visually impaired people to understand the target page, and avoid losing direction in the target page. For example, after the visually impaired person enters the toolbar through the shortcut key, the "enter toolbar" needs to be read; when entering a specific menu from the toolbar, the 'enter menu' needs to be read out; after entering the menu, it is also necessary to read out which item position of the menu the current cursor is located at, and the like.

In the embodiment of the application, if the displayed target page contains an image, the semantic description information of the image can be acquired, and then the prompt audio matched with the semantic description information of the image is played. The semantic description information of the image can describe the content expressed by the image semantically, and the semantic description information is converted into matched prompt audio for broadcasting, so that the image (namely non-text content) in the target page can be effectively read, and the reading integrity of the page information is improved. In addition, the prompting audio matched with the semantic description information of the played image at that time by adopting voice broadcasting can accurately and completely express the content expressed by the image, so that the visual impairment person can be assisted to understand the semantics of the image, richer page information can be provided in the screen reading process of the target page, and the intelligence of the page screen reading processing is improved.

FIG. 6 is a flowchart illustrating a page processing method according to an exemplary embodiment of the present application; the page processing method may be performed by a computer device (such as the computer device 101 mentioned above), and may include, but is not limited to, steps S601-S606:

s601: and displaying the target page.

According to different page types of the target page, the semantic description information of the image is obtained in different modes. In one implementation, the target page is a first page, the first page is a page that does not support a user to edit content in the page, and the first page may include any one of the following: the system comprises a webpage, a service page of an application program, a service page of an applet, a multimedia playing page and the like; when the target page is the first page, the image included in the target page is a native image in the first page (if the target page is not an image added by the user) and does not support editing, and the implementation manner for acquiring the semantic description information of the image may include: and in the process of loading the first page, obtaining semantic description information of the image. In other words, when the target page is the first page, the operation of acquiring the semantic description information of the image is triggered to be executed in the process of loading (or rendering) the first page in the background. For example, the target page is any web page displayed in the browser, and the web page only supports the user to browse the content included in the web page, and in the process of loading the web page in the background, the semantic description information of the image in the web page can be triggered to be acquired.

In other implementations, the target page is a second page, the second page is a page supporting the user to edit content in the page, and the second page may include any one of the following: document editing pages, online document editing pages, social session pages, questionnaire-like pages, and the like; when the target page is a second page and the image in the target page is added to the second page through an editing operation, an implementation manner of acquiring semantic description information of the image may include: when the image is added in the second page, semantic description information of the image is acquired. Of course, if the second page also includes the native image, the semantic description information of the native image is obtained in the process of loading the second page.

As described in the foregoing embodiment shown in fig. 2, the semantic description information of the acquired image may include at least one of the following: the description information of the target color presented by the image, the description information of the text content contained in the image, the description information of the source of the image, the description information of the author of the image, the description information of the object contained in the image, and the description information of the behavior executed by the object contained in the image. Because the semantic description information comprises the identification of the color of the image and the related information such as the object contained by the image; it is easy to find that the embodiment of the present application relates to Computer Vision technology (CV) in the field of artificial intelligence, which is a science for researching how to make a machine "see", and further refers to using a camera and a Computer to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further performing image processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronized positioning and mapping, among other techniques. Specifically, the manner of obtaining the semantic description information of the image based on the computer vision technology can be seen in the following related description of the specific implementation manner shown in steps S602-S605; in the embodiment of the present application, the execution sequence of steps S602 to S605 is not limited, and is described here.

S602: and if the target page contains the image, acquiring the description information of the target color presented by the image.

The target color may refer to a color with the largest display area in the image, or the target color may refer to a color of an object included in the image, or the target color may refer to a background color of the image, or the like; specifically, the target color may be defined according to business requirements, which is not limited in the embodiment of the present application. The method and the device can determine the target color of the image based on the HSL color mode, wherein the HSL color mode is a color representation method for representing points in the RGB (red, green and blue) color mode in a cylindrical coordinate system; wherein H represents hue (i.e., color), and H has a value in the range of [0,360], and when H is 0 or 360, hue (i.e., color) is red, when H is 120, hue is green, and when H is 240, hue is blue; s represents saturation, and S has a value in the range of [ 0%, 100% ], and when S is 0%, the saturation is gray, and when S is 100%, the saturation is transparent; l represents lightness, which ranges from [ 0%, 100% ], and when L is 0%, lightness appears dark, when L is 50%, lightness appears normal, and when L is 100%, lightness appears white. A schematic representation of colors in a cylindrical coordinate system can be seen in fig. 7.

In the embodiment of the application, the description information of the target color presented by the acquired image is described by taking the example that the target color is the color with the largest display area in the image; specifically, first, S colors are identified from the image, where S is an integer greater than 1, and the S colors may be colors corresponding to different objects (such as a person, a hat, a sofa, etc.) in the image, and of course, the same object may correspond to different colors, for example, the object is a person, a coat worn by the person may be red, and trousers worn by the person may be black; acquiring the pixel number and the saturation of each color in the S colors, and multiplying the pixel number and the saturation of each color respectively to obtain the color score of each color; and finally, determining the color corresponding to the maximum color score in the S colors as the target color of the image, and generating description information for describing the target color. That is, the saturation and the number of pixels of each color included in the image are used to evaluate the color score of one color, and the color corresponding to the highest color score is determined as the target color of the image; wherein, the formula for calculating the color score of each color is as follows:

score＝pixel nums*saturation

score represents the color score, pixel nums represents the number of pixels and saturation represents the saturation.

For example, the image includes 3 colors, red, yellow and blue; wherein, the number of red pixels is 40, the saturation is 40%, the number of yellow pixels is 70, the saturation is 64%, the number of blue pixels is 80, the saturation is 49%; then the color scores of the three colors are calculated separately, which can be: the color score of red is 40 × 40% ═ 16%, the color score of yellow is 70 × 64% ═ 44.8, and the color score of blue is 80 × 49% ═ 39.2, and yellow is determined to be the target color of the image if the color score of yellow is the highest.

It can be understood that, the embodiment of the application also supports that S color scores corresponding to S colors included in an image are sorted from high to low, and a plurality of colors with sorting positions before the position threshold are determined as target colors; when the audio matched with the color is played, the color and the object corresponding to the color can be played together. Referring to fig. 8, the object included in the image includes: boys 801, a slide 802, and a hat 803, wherein hat 803 is black and slide 802 is white, the audio played matching the image may be "a boy wearing a black hat and sliding a white slide". It should be noted that the position threshold may be set by a service person according to a service requirement, and a specific value of the position threshold is not limited in this embodiment.

S603: and acquiring description information of the text content contained in the image.

FIG. 9 is a flow chart illustrating a process for obtaining an image containing text content according to an exemplary embodiment of the present application; as shown in fig. 9, the process of acquiring the text content included in the image may include:

preprocessing an image.

The image is preprocessed, so that useless information in the image can be reduced, and subsequent feature extraction and learning are facilitated. Among these, pre-treatments may include, but are not limited to: graying, noise reduction, binarization, character segmentation, normalization and the like. The gray processing is carried out on the image (such as a color image), so that the original data volume of the image can be reduced, and the subsequent processing time and the calculation amount are smaller; the image is subjected to noise reduction processing, so that the noise in the image can be eliminated or inhibited, the image quality is improved, and the quality of noise reduction has great influence on subsequent feature extraction; the binarization of the image is a process of setting the gray value of each pixel point on the image to be 0 or 255 so that the image has an obvious black-white effect; the character segmentation means that characters in an image are segmented into single characters, so that the characters are recognized one by one in the following process of recognizing the characters, and certainly, if the characters in the image are inclined, the inclination correction of the characters is often needed, so that the subsequent processing of the characters is facilitated; normalization is to normalize the individual characters in the image to the same size and rule, so that the characters can be processed by the same algorithm in the subsequent process. In other embodiments, the characters in the image may be divided into words, so that the words may be recognized one by one subsequently when the characters are recognized.

And secondly, extracting image features of the preprocessed image to obtain the image features.

The image characteristics are key information for identifying the text content in the image, and each different text in the image can be distinguished from other texts through the image characteristics. The embodiment of the application supports the adoption of various feature extraction algorithms to perform feature extraction on the image in the target page, and the feature extraction algorithms can include but are not limited to: algorithms such as Histogram of Oriented Gradient (HOG), scale-invariant features transform (SIFT), and difference of gaussian function (difference of gaussians); the embodiment of the present application does not limit which specific feature extraction algorithm is used to extract the image features of the image. Certainly, if the dimensionality of the extracted image features is too high, the subsequent identification efficiency and quality of the image features can be affected, so that the embodiment of the application also supports dimension reduction processing on the image features with higher dimensionality so as to facilitate subsequent identification.

And thirdly, classifying the image features by adopting a classifier so as to identify the text content contained in the image.

The classifier is a general term of a method for classifying data; the classifier may be configured to receive image features of the image and classify the image features to identify which text content the image features should be identified. Common classifiers may include, but are not limited to: decision trees, logistic regression, naive Bayes, neural network algorithms, and the like; the embodiment of the application does not limit the specific classifier used.

S604: the description information of the source of the image and the description information of the author of the image are obtained.

In the embodiment of the present application, taking an example that an image is derived from a local space or a network file, a description information of a source of the image and a description information of an author of the image are obtained. Referring to fig. 10, fig. 10 is a flow chart illustrating a method for obtaining description information of a source of an image and description information of an author of the image according to an exemplary embodiment of the present application; as shown in fig. 10: first, a source of the image is obtained, the source indicating that the image is from a local space or a network file. Secondly, if the source of the image indicates that the image is from the local space, reading an author of the image from the local space, for example, reading image information from a file attribute in an operating system of the intelligent device, wherein the image information includes the author of the image; if the source of the image indicates that the image is from the network file, acquiring a link of the image, and reading an author of the image according to the link; the links of the images can be hyperlinks, and the links support direct jump to the webpage containing the images after being clicked. Finally, descriptive information of the source of the image and of the author of the image is generated.

S605: description information of an object included in an image and description information of a behavior performed on the object included in the image are acquired.

In the specific implementation, a visual word list model can be called to perform object identification processing on the image, and objects contained in the image and behavioral sentence patterns related to the objects are identified and obtained; the method includes reproducing description information for an object included in the image, and generating description information for a behavior performed on the object included in the image according to the object and a behavior pattern associated with the object. Wherein, the visual word list model can comprise a transformer model (or a machine translation model) based on an attention mechanism; for a given image and candidate objects (i.e., predefined identifiable objects), the visual vocabulary model may generate a series of characters by auto-regression (i.e., a form of regression analysis), and then generate sentences (i.e., character strings) describing the image according to the behavioral sentence pattern. The training and application of the visual vocabulary model is described in more detail below in conjunction with FIGS. 11a and 11 b:

the embodiment of the application adopts a visual vocabulary pre-training method to train a visual vocabulary model; the process of training the visual vocabulary model may include a pre-training phase and a fine-tuning phase. Specifically, the visual vocabulary pre-training method supports multi-modal pre-training of images and texts without text labels (e.g., text labeling the semantics to be expressed in the sample images), where multi-modal may refer to different expression forms of an object, such as representing the same object by using images, characters, animation, and the like. This allows the training of visual vocabulary models to be independent of paired image and text labels, while allowing the use of large computer vision data sets, such as class labels (tags) for image recognition; by means of the visual vocabulary pre-training method, the visual vocabulary model can establish the relation between the visual appearance and the semantic name of various objects (or objects) through large-scale data learning, namely, the visual vocabulary (visual vocabularies). A visual vocabulary may be defined as a joint feature space (join embedding space) of images and words in which semantically similar text (or characters) may be mapped to more closely spaced feature vectors. The following describes the pre-training phase and the fine-tuning phase, respectively, wherein:

a pre-training stage. In the pre-training stage, semantically similar class labels (or texts) and corresponding image features can be mapped to feature vectors with closer distances; as shown in fig. 11a, assuming that the semantics of image 1 describes a yellow puppy and the semantics of image 2 describes a black dog, since the semantics of image 1 and the semantics of image 2 both represent the same category, i.e., dog, the image features and corresponding category labels of image 1 (yellow dog) and image features and corresponding category labels of image 2 (black dog) may be mapped onto the feature vector at the top left corner of the visual vocabulary; assuming that the semantics of image 3 describe an instrument and the semantics of image 4 describe an accordion, the image features and corresponding class labels (instruments) of image 3, and the image features and corresponding class labels (accordion) of image 4 may be mapped onto the feature vector at the lower left corner of the visual vocabulary, and so on; and then establishing a visual word list in a pre-training stage, wherein the category labels with similar semantics in the visual word list and the corresponding image features are mapped to feature vectors with closer distances. More specifically, as shown in fig. 11b, a multi-layer visual vocabulary model may be used to predict the classification of the image in the pre-training stage; specifically, a plurality of sample images and each class label corresponding to the sample images are given, then part of the class labels are erased randomly, and the visual vocabulary model is used for predicting the erased class labels, so that the function training of objects contained in the predicted images of the visual vocabulary model is realized. As shown in fig. 11b, the category label "skateboarding" can be erased, allowing the visual vocabulary model to predict the category label. Of course, since the order among the plurality of category labels can be interchanged, the embodiment of the present application may use Hungarian algorithm (Hungarian matching) to find a one-to-one correspondence between the predicted category result and the category label, and then calculate a cross entropy loss (cross entropy loss) function to evaluate whether the optimization of the visual vocabulary model is successful. The Hungarian algorithm is a combined optimization algorithm, and predicted category results can be matched with the category labels by the Hungarian algorithm.

And the fine adjustment stage. After the visual vocabulary is generated in the pre-training stage, the embodiment of the application also supports fine tuning of the visual vocabulary model by adopting the sample image matched with the text label, so that the visual vocabulary model can have the function of generating a universal behavior sentence pattern template according to the sample image and the recognized object, and the description information for describing the behavior of the object in the image can be obtained by filling the recognized object into the behavior sentence pattern template. It should be noted that the class label corresponding to the sample image during the fine adjustment may be a label from the sample data set (i.e., a set including a plurality of sample images paired with text labels), or may be automatically generated by other trained image classification or object recognition models, which is not limited in this embodiment of the present application. As shown in fig. 11a, in the fine tuning stage, a text corresponding to a sample image is labeled as "a boy with a hat slide board", and the text label includes an object "hat", an object "slide board", and an object "boy"; if the predicted text content is 'a boy with a hat' after the sample image is recognized by the visual vocabulary model, determining that the object 'a skateboard' cannot be predicted by the visual vocabulary model, and continuing to train the visual vocabulary model; otherwise, determining to obtain the optimized visual word list model. In summary, fine-tuning the visual vocabulary model may train the ability of the visual vocabulary model to generate sentences that describe the semantics of the image.

In summary, the visual vocabulary model is pre-trained and fine-tuned, so that the trained visual vocabulary model can more accurately recognize the object contained in the image and the information such as the behavior executed by the object, and the recognition performance and efficiency of the visual vocabulary model can be improved.

S606: and playing prompt audio matched with the semantic description information of the image.

In specific implementation, after semantic description information of an image is obtained, a hidden document object node can be created for a target page, wherein the document object node is an Application Program Interface (API) independent of a platform and a language, the document object node can dynamically access a program and a script, update the content, the structure and the style of a www document, and set auxiliary attributes (such as aria-live attributes) for the hidden document object node; processing semantic description information of the image into template texts in a templated form; writing the template text into the hidden document object node; when the writing operation is monitored, matching a prompt audio frequency for the written template text; and playing the prompt audio.

It should be noted that the target page may be rendered in a canvas manner, and the above-mentioned image may be rendered in the canvas manner. The canvas tag is a type of hypertext tag, and can dynamically draw graphics through scripts (such as JavaScript). When the target page is rendered by the canvas tag, the implementation manner of playing the prompt audio matched with the semantic description information of the image shown in step S606 may include: creating a hidden document object node for the target page, and setting an auxiliary attribute for the hidden document object node; writing the content in the canvas node into the hidden document object node; and when the write operation is monitored, matching the content audio for the written content and playing the content audio. In other words, when the target page is rendered by the canvas tag, the embodiment of the present application supports identifying the content rendered by the canvas tag and playing the content audio corresponding to the content.

An exemplary implementation of playing content audio matching the content contained in the canvas node is described below with reference to fig. 12, taking the presence of an operation event (e.g., a full selection event) on the canvas node as an example. As shown in fig. 12, monitoring an operation event of a user on the canvas node, for example, the user fully selects content on the canvas node through a keyboard (e.g., a real keyboard), for example, fully selects content on the canvas node through a fully selected shortcut key (ctrl + a). Obtaining a feedback result of the canvas node on the operation event; for example, a monitoring component (access capability component) monitors the operation events on the canvas node, and the monitoring component sends the operation events to the user operation categories, so as to obtain the feedback result after the user operation is judged. And thirdly, sending the feedback result to the hidden document object node so that the hidden document object node can change the feedback result into node content. And fourthly, when monitoring that the node content of the document object node changes, matching the written operation event with the feedback result and playing the operation audio.

In the embodiment of the application, for a displayed target page, if the target page contains an image, semantic description information of the image can be acquired, and then prompt audio matched with the semantic description information of the image is played; the semantic description information can describe the content expressed by the image semantically, and the semantic description information is converted into matched prompt audio for broadcasting. In addition, if the target page is rendered in a canvas manner, the embodiment of the application may further identify the content contained in the canvas node and play the content audio matched with the content contained in the canvas node. Therefore, the effective screen reading of the image or canvas node (non-text content) in the target page is realized, and the completeness of page information reading is improved; moreover, the content expressed by the image or canvas node can be accurately and completely expressed by playing the prompt audio, so that the semantics of the image or canvas node can be understood in an auxiliary manner, richer page information can be provided in the process of reading the target page, and the intelligence of page screen reading processing is improved.

While the method of the embodiments of the present application has been described in detail above, to facilitate better implementation of the above-described aspects of the embodiments of the present application, the apparatus of the embodiments of the present application is provided below accordingly.

FIG. 13 is a schematic structural diagram of a page processing apparatus according to an exemplary embodiment of the present application; the page processing means may be a computer program (comprising program code) running on a computer device, for example the page processing means may be an application (such as an Tencent document) in a computer device; the page processing apparatus may be used to perform some or all of the steps in the method embodiments shown in fig. 2 and 6. Referring to fig. 13, the page processing apparatus includes the following units:

a display unit 1301 for displaying a target page;

a processing unit 1302, configured to obtain semantic description information of an image if the target page includes the image;

the processing unit 1302 is further configured to play a prompt audio matched with the semantic description information of the image.

In one implementation, the processing unit 1302 is further configured to start a screen reading mode;

the processing unit 1302, configured to, when playing a prompt audio matched with the semantic description information of the image, specifically: and in the screen reading mode, playing prompt audio matched with the semantic description information of the image.

In one implementation, the target page further includes other content, and the other content includes at least one of the following: text, rich text, icons; the processing unit 1302 is further configured to:

In one implementation, the target page further includes operation information, and the operation information includes at least one of the following items: the information of the operator, the operated object, the type of the operated object, the information of the operation item, the feedback presented when the operation item is selected, and the change caused by the operation information to the target page; the processing unit 1302 is further configured to:

In one implementation, a target page refers to any service page in a target application; the target application program supports a screen reading mode and provides an entrance of the screen reading mode; the processing unit 1302, configured to, when the screen reading mode is started, specifically:

In one implementation, if the target page is a first page, the image is a native image in the first page and does not support editing; the processing unit 1302 is configured to, when obtaining semantic description information of an image, specifically:

In one implementation, if the target page is a second page, the image is added to the second page by an editing operation; the processing unit 1302 is configured to, when obtaining semantic description information of an image, specifically:

In one implementation, the semantic description information includes description information of a target color presented to the image; the processing unit 1302 is configured to, when obtaining semantic description information of an image, specifically:

identifying S colors from the image, wherein S is an integer larger than 1;

acquiring the pixel number and the saturation of each color in the S colors;

descriptive information for describing the target color is generated.

In one implementation, the semantic description information includes description information of text content contained in the image; the processing unit 1302 is configured to, when obtaining semantic description information of an image, specifically:

preprocessing the image;

In one implementation, the semantic description information includes description information of a source of the image and description information of an author of the image; the processing unit 1302 is configured to, when obtaining semantic description information of an image, specifically:

acquiring a source of an image;

In one implementation, the semantic description information includes description information of an object included in the image and description information of a behavior performed on the object included in the image; the processing unit 1302 is configured to, when obtaining semantic description information of an image, specifically:

In one implementation, the processing unit 1302, when playing the cue audio matched with the semantic description information of the image, is specifically configured to:

writing the template text into the hidden document object node;

and playing prompt audio.

In one implementation, the target page is rendered in a Canvas manner; the processing unit 1302 is further configured to:

writing the content in the Canvas node into a hidden document object node;

In one implementation, the processing unit 1302 is further configured to:

According to an embodiment of the present application, the units in the page processing apparatus shown in fig. 13 may be respectively or entirely combined into one or several other units to form the page processing apparatus, or some unit(s) therein may be further split into multiple functionally smaller units to form the page processing apparatus, which may implement the same operation without affecting implementation of technical effects of embodiments of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the page processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of multiple units. According to another embodiment of the present application, the page processing apparatus shown in fig. 13 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods shown in fig. 2 and 6 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and implementing the page processing method of the embodiment of the present application. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the above-described computing apparatus via the computer-readable recording medium.

In this embodiment of the present application, for a displayed target page, if the target page includes an image, the processing unit 1302 may obtain semantic description information of the image, and then play a prompt audio matched with the semantic description information of the image. In the scheme, the semantic description information can describe the content expressed by the image semantically, and the semantic description information is converted into the matched prompt audio for broadcasting, so that the image (non-text content) in the target page can be effectively read, and the reading integrity of the page information is improved; in addition, the content expressed by the image can be accurately and completely expressed by playing the prompt audio, so that the semantic meaning of the image is assisted to be understood, richer page information can be provided in the process of reading the screen of the target page, and the intelligence of page screen reading processing is improved.

Fig. 14 shows a schematic structural diagram of a computer device according to an exemplary embodiment of the present application. Referring to fig. 14, the computer device includes a processor 1401, a communication interface 1402, and a computer-readable storage medium 1403. The processor 1401, the communication interface 1402, and the computer-readable storage medium 1403 may be connected by a bus or other means, among others. The communication interface 1402 is used for receiving and transmitting data, among other things. A computer-readable storage medium 1403 may be stored in the memory of the computer device, the computer-readable storage medium 1403 being used to store a computer program comprising program instructions for execution by the processor 1401 of the program instructions stored by the computer-readable storage medium 1403. The processor 1401 (or CPU) is a computing core and a control core of the computer device, and is adapted to implement one or more instructions, and in particular, is adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function.

Embodiments of the present application also provide a computer-readable storage medium (Memory), which is a Memory device in a computer device and is used for storing programs and data. It is understood that the computer readable storage medium herein can include both built-in storage media in the computer device and, of course, extended storage media supported by the computer device. The computer readable storage medium provides a memory space that stores a processing system of the computer device. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by processor 1401. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; optionally, at least one computer readable storage medium located remotely from the aforementioned processor is also possible.

In one embodiment, the computer device may be the smart device mentioned in the previous embodiment; the computer-readable storage medium has one or more instructions stored therein; one or more instructions stored in a computer-readable storage medium are loaded and executed by processor 1401 to implement the corresponding steps in the above-described embodiments of the page processing method; in particular implementations, one or more instructions in the computer-readable storage medium are loaded and executed by processor 1401 to perform the steps of:

displaying a target page;

In one implementation, one or more instructions in a computer-readable storage medium are loaded by processor 1401 and further perform the steps of: starting a screen reading mode;

one or more instructions in the computer-readable storage medium are loaded by processor 1401 and when executing playing hinting audio that matches semantic description information of an image, perform the following steps:

and in the screen reading mode, playing prompt audio matched with the semantic description information of the image.

In one implementation, the target page further includes other content, and the other content includes at least one of the following: text, rich text, icons; one or more instructions in the computer readable storage medium are loaded by processor 1401 and further perform the steps of:

In one implementation, the target page further includes operation information, and the operation information includes at least one of the following items: the information of the operator, the operated object, the type of the operated object, the information of the operation item, the feedback presented when the operation item is selected, and the change caused by the operation information to the target page; one or more instructions in the computer readable storage medium are loaded by processor 1401 and further perform the steps of:

In one implementation, a target page refers to any service page in a target application; the target application program supports a screen reading mode and provides an entrance of the screen reading mode; one or more instructions in the computer-readable storage medium are loaded by processor 1401 and when executing the start screen reading mode, specifically perform the following steps:

In one implementation, if the target page is a first page, the image is a native image in the first page and does not support editing; one or more instructions in the computer readable storage medium are loaded by processor 1401 and when executing the semantic description information for obtaining an image, perform the following steps:

In one implementation, if the target page is a second page, the image is added to the second page by an editing operation; one or more instructions in the computer readable storage medium are loaded by processor 1401 and when executing the semantic description information for obtaining an image, perform the following steps:

In one implementation, the semantic description information includes description information of a target color presented to the image; one or more instructions in the computer-readable storage medium are loaded by processor 1401 and when executing the process of obtaining semantic description information for an image, perform the following steps:

identifying S colors from the image, wherein S is an integer larger than 1;

acquiring the pixel number and the saturation of each color in the S colors;

descriptive information for describing the target color is generated.

In one implementation, the semantic description information includes description information of text content contained in the image; one or more instructions in the computer-readable storage medium are loaded by processor 1401 and when executing the process of obtaining semantic description information for an image, perform the following steps:

preprocessing the image;

In one implementation, the semantic description information includes description information of a source of the image and description information of an author of the image; one or more instructions in the computer-readable storage medium are loaded by processor 1401 and when executing the process of obtaining semantic description information for an image, perform the following steps:

acquiring a source of an image;

In one implementation, the semantic description information includes description information of an object included in the image and description information of a behavior performed on the object included in the image; one or more instructions in the computer-readable storage medium are loaded by processor 1401 and when executing the process of obtaining semantic description information for an image, perform the following steps:

In one implementation, one or more instructions in a computer-readable storage medium are loaded by processor 1401 and when executing the process of playing hinting audio that matches semantic descriptive information of an image, perform the following steps:

writing the template text into the hidden document object node;

and playing prompt audio.

In one implementation, the target page is rendered in a Canvas manner; one or more instructions in the computer readable storage medium are loaded by processor 1401 and further perform the steps of:

writing the content in the Canvas node into a hidden document object node;

In one implementation, one or more instructions in a computer-readable storage medium are loaded by processor 1401 and further perform the steps of:

In this embodiment, for a displayed target page, if the target page includes an image, the processor 1401 may acquire semantic description information of the image, and then play a prompt audio matched with the semantic description information of the image. In the scheme, the semantic description information can describe the content expressed by the image semantically, and the semantic description information is converted into the matched prompt audio for broadcasting, so that the image (non-text content) in the target page can be effectively read, and the reading integrity of the page information is improved; in addition, the content expressed by the image can be accurately and completely expressed by playing the prompt audio, so that the semantic meaning of the image is assisted to be understood, richer page information can be provided in the process of reading the screen of the target page, and the intelligence of page screen reading processing is improved.

Embodiments of the present application also provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the page processing method.

Those of ordinary skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the invention are all or partially effected when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes), optical media (e.g., DVDs), or semiconductor media (e.g., Solid State Disks (SSDs)), among others.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A page processing method is characterized by comprising the following steps:

displaying a target page;

if the target page contains an image, semantic description information of the image is acquired;

2. The method of claim 1, wherein the semantic description information of the image is used to semantically describe the image;

the semantic description information includes at least one of: description information of a target color presented by the image, description information of text content contained by the image, description information of a source of the image, description information of an author of the image, description information of an object contained in the image, and description information of a behavior executed by the object contained in the image;

the cue audio is for alerting at least one of: the image processing device is used for prompting a target color presented by the image, prompting text content contained in the image, prompting a source of the image, prompting an author of the image, prompting an object contained in the image and prompting an action executed by the object contained in the image.

3. The method of claim 1, wherein the method further comprises: starting a screen reading mode;

the playing of the prompt audio matched with the semantic description information of the image comprises: and playing a prompt audio matched with the semantic description information of the image in the screen reading mode.

4. The method of claim 3, wherein the target page further comprises other content, the other content comprising at least one of: text, rich text, icons; the method further comprises the following steps:

5. The method of claim 3, wherein the target page further contains operational information, the operational information including at least one of: information of an operator, an operated object, a type of the operated object, information of an operation item, feedback presented when the operation item is selected, and a change caused by the operation information to the target page; the method further comprises the following steps:

and converting the operation information in the target page into operation audio for playing and outputting in the screen reading mode.

6. The method of claim 3, wherein the target page is any service page in the target application; the target application program supports a screen reading mode and provides an entrance of the screen reading mode; the starting screen reading mode comprises the following steps:

7. The method of claim 1, wherein if the target page is a first page, the image is a native image in the first page and does not support editing; the obtaining semantic description information of the image includes:

8. The method of claim 1, wherein if the target page is a second page, the image is added to the second page by an editing operation; the obtaining semantic description information of the image includes:

when the image is added to the second page, semantic description information of the image is acquired;

9. A method as claimed in claim 1 or 2, wherein the semantic descriptive information comprises descriptive information of a target color presented to the image; the obtaining of the semantic description information of the image includes:

identifying S colors from the image, S being an integer greater than 1;

acquiring the pixel number and the saturation of each color in the S colors;

multiplying the number of pixels of each color by the saturation respectively to obtain the color value of each color;

generating description information for describing the target color.

10. The method according to claim 1 or 2, wherein the semantic description information includes description information of text content contained in the image; the obtaining of the semantic description information of the image includes:

preprocessing the image;

classifying the image features by adopting a classifier so as to identify the text content contained in the image;

11. The method of claim 1 or 2, wherein the semantic description information includes description information of a source of the image and description information of an author of the image; the obtaining of the semantic description information of the image includes:

obtaining a source of the image;

if the source of the image indicates that the image is from a local space, reading an author of the image from the local space;

if the source of the image indicates that the image is from a network file, acquiring a link of the image, and reading an author of the image according to the link;

generating the descriptive information of the source of the image and the descriptive information of the author of the image.

12. The method according to claim 1 or 2, wherein the semantic description information includes description information of an object included in the image and description information of a behavior performed on the object included in the image; the obtaining of the semantic description information of the image includes:

generating description information of an object contained in the image, and generating description information of a behavior executed on the object contained in the image according to the object and a behavior pattern associated with the object.

13. The method of claim 1 or 2, wherein playing the cue audio matching the semantic description information of the image comprises:

processing the semantic description information of the image into a template text according to a templating form;

writing the template text into the hidden document object node;

when the writing operation is monitored, matching the written template text with the prompt audio; and the number of the first and second groups,

and playing the prompt audio.

14. The method of claim 1, wherein the target page is rendered in a Canvas manner; the method further comprises the following steps:

writing the content in the Canvas node into the hidden document object node;

15. The method of claim 14, wherein the method further comprises:

monitoring the operation event on the Canvas node and the feedback result of the operation event;

writing the operation event and the feedback result into the hidden document object node;

when the write operation is monitored, matching operation audio for the written operation event and the feedback result; and playing the operation audio.

16. A page processing apparatus, comprising:

a display unit for displaying a target page;

the processing unit is further used for playing prompt audio matched with the semantic description information of the image.

17. A computer device, comprising:

a processor adapted to execute a computer program;

computer-readable storage medium, in which a computer program is stored which, when being executed by the processor, carries out the page processing method according to any one of claims 1 to 15.

18. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program adapted to be loaded by a processor and to perform the page processing method according to any one of claims 1-15.