New York Tech Media
  • News
  • FinTech
  • AI & Robotics
  • Cybersecurity
  • Startups & Leaders
  • Venture Capital
No Result
View All Result
  • News
  • FinTech
  • AI & Robotics
  • Cybersecurity
  • Startups & Leaders
  • Venture Capital
No Result
View All Result
New York Tech Media
No Result
View All Result
Home AI & Robotics

Beyond ‘Reader Mode’ With Machine Learning

New York Tech Editorial Team by New York Tech Editorial Team
November 3, 2021
in AI & Robotics
0
Beyond ‘Reader Mode’ With Machine Learning
Share on FacebookShare on Twitter

Researchers from South Korea have used machine learning to develop an improved method for extracting actual content from web pages so that the ‘furniture’ of a web page – such as sidebars, footers and navigation headers, as well as advertisement blocks – disappears for the reader.

Though such functionality is either built into most popular web browsers, or else is easily available via extensions and plugins, these technologies rely on semantic formatting that may not be present in the web page, or which may have been deliberately compromised by the site owner in order to prevent the reader hiding the ‘full fat’ experience of the page.

One of our own web pages 'slimmed down' with Firefox's integral Reader View functionality.

One of our own web pages ‘slimmed down’ with Firefox’s integral Reader View functionality.

Instead, the new method uses a grid-based system that iterates through the web page, evaluating how pertinent the content is to the core aim of the page.

The content extraction pipeline first divides the page into a grid (upper row) before evaluating the relationship of found pertinent cells to other cells (middle) and finally merging the approved cells (bottom). Source: https://arxiv.org/ftp/arxiv/papers/2110/2110.14164.pdf

The content extraction pipeline first divides the page into a grid (upper row) before evaluating the relationship of found pertinent cells to other cells (middle) and finally merging the approved cells (bottom). Source: https://arxiv.org/ftp/arxiv/papers/2110/2110.14164.pdf

Once a pertinent cell is identified, its relationship with nearby cells is also evaluated before being merged into the interpreted ‘core content’.

The central idea of the approach is to abandon code-based markup as an index of relevance (i.e. HTML tags that would normally denote the beginning of a paragraph, for instance, which can be replaced by alternate tags that will ‘fool’ screen readers and utilities such as Reader View), and deduce the content based solely on its visual appearance.

The approach, called Grid-Center-Expand (GCE), has been extended by the researchers into Deep Neural Network (DNN) models that exploit Google’s TabNet, an interpretative tabular learning architecture.

Get To the Point

The paper is titled Don’t read, just look: Main content extraction from web pages using visually apparent features, and comes from three researchers at Hanyang University, and one from the Institute of Convergence Technology, all located in Seoul.

Improved extraction of core web page content is potentially valuable not only for the casual end-user, but also for machine systems that are tasked with ingesting or indexing domain content for the purposes of Natural Language Processing (NLP), and other sectors in AI.

As it stands, if non-relevant content is included in such extraction processes, it may need to be manually filtered (or labeled), at great expense; worse, if the unwanted content is included with the core content, it could affect how the core content is interpreted, and the outcome of transformer and encoder/decoder systems that are relying on clean content.

An improved method, the researchers argue, is especially necessary because existing approaches often fail with non-English web pages.

French, Japanese and Russian web pages are noted as scoring worst in success rates for the four most common 'Reader View' approaches: Mozilla's Readability.js; Google's DOM Distiller; Web2Text; and Boilernet.

French, Japanese and Russian web pages are noted as scoring worst in success rates for the four most common ‘Reader View’ approaches: Mozilla’s Readability.js; Google’s DOM Distiller; Web2Text; and Boilernet.

Datasets and Training

The researchers compiled dataset material from English keywords in the GoogleTrends-2017 and GoogleTrends-2020 dataset, though they observe that, in terms of results, there were no practical differences between the two datasets.

Additionally, the authors gathered non-English keywords from South Korea, France, Japan, Russia, Indonesia and Saudi Arabia. Chinese keywords were added from a Baidu dataset, since Google Trends could not offer Chinese data.

Testing and Results

In testing the system, the authors found that it offer the same level of performance as recent DNN models, while providing better accommodation for a wider variety of languages.

For instance, the Boilernet architecture, while maintaining good performance in extracting pertinent content, adapts poorly to Chinese and Japanese datasets, while Web2Text, the authors find, has ‘relatively poor performance’ all round, with linguistic features that are not multilingual, and are unsuited for extracting central content from web pages.

Mozilla’s Readbility.js was found to achieve acceptable performance across multiple languages including English, even as a rule-based method. However the researchers found that its performance dropped notably on Japanese and French datasets, highlighting the limitations of trying to parse characteristics of a specific region entirely by rule-based approaches.

Meanwhile Google’s DOM Distiller, which blends heuristics and machine learning approaches, was found to perform well across the board.

Table of results for methods tested during the project, including the researchers' own GCE module. Higher numbers are better.

Table of results for methods tested during the project, including the researchers’ own GCE module. Higher numbers are better.

The researchers conclude that ‘GCE does not need to keep up with the rapidly changing web environment because it relies on human nature—genuinely global and multilingual features’.

 

Credit: Source link

Previous Post

Warehouse Robotics Market with COVID-19 Impact Analysis by Type, Function, Payload, Industry and Region – Global Forecast to 2026

Next Post

Remote working intensifies Europe’s battle for startup talent

New York Tech Editorial Team

New York Tech Editorial Team

New York Tech Media is a leading news publication that aims to provide the latest tech news, fintech, AI & robotics, cybersecurity, startups & leaders, venture capital, and much more!

Next Post
Remote working intensifies Europe’s battle for startup talent

Remote working intensifies Europe's battle for startup talent

  • Trending
  • Comments
  • Latest
Meet the Top 10 K-Pop Artists Taking Over 2024

Meet the Top 10 K-Pop Artists Taking Over 2024

March 17, 2024
Panther for AWS allows security teams to monitor their AWS infrastructure in real-time

Many businesses lack a formal ransomware plan

March 29, 2022
Zach Mulcahey, 25 | Cover Story | Style Weekly

Zach Mulcahey, 25 | Cover Story | Style Weekly

March 29, 2022
How To Pitch The Investor: Ronen Menipaz, Founder of M51

How To Pitch The Investor: Ronen Menipaz, Founder of M51

March 29, 2022
Japanese Space Industry Startup “Synspective” Raises US $100 Million in Funding

Japanese Space Industry Startup “Synspective” Raises US $100 Million in Funding

March 29, 2022
UK VC fund performance up on last year

VC-backed Aerium develops antibody treatment for Covid-19

March 29, 2022
Startups On Demand: renovai is the Netflix of Online Shopping

Startups On Demand: renovai is the Netflix of Online Shopping

2
Robot Company Offers $200K for Right to Use One Applicant’s Face and Voice ‘Forever’

Robot Company Offers $200K for Right to Use One Applicant’s Face and Voice ‘Forever’

1
Menashe Shani Accessibility High Tech on the low

Revolutionizing Accessibility: The Story of Purple Lens

1

Netgear announces a $1,500 Wi-Fi 6E mesh router

0
These apps let you customize Windows 11 to bring the taskbar back to life

These apps let you customize Windows 11 to bring the taskbar back to life

0
This bipedal robot uses propeller arms to slackline and skateboard

This bipedal robot uses propeller arms to slackline and skateboard

0
The Future of “I Do”: How Technology is Revolutionizing Weddings in 2025

The Future of “I Do”: How Technology is Revolutionizing Weddings in 2025

March 19, 2025
Eldad Tamir

AI vs. Traditional Investing: How FINQ’s SEC RIA License Signals a New Era in Wealth Management

March 17, 2025
Overcoming Payment Challenges: How Waves Audio Streamlined Transactions with BridgerPay

Overcoming Payment Challenges: How Waves Audio Streamlined Transactions with BridgerPay

March 16, 2025
Arvatz and Iyer

PointFive and Emertel Forge Strategic Partnership to Elevate Enterprise FinOps in ANZ

March 13, 2025
Global Funeral Traditions Meet Technology

Global Funeral Traditions Meet Technology

March 9, 2025
Canditech website

Canditech is Revolutionizing Hiring With Their New Product

March 9, 2025

Recommended

The Future of “I Do”: How Technology is Revolutionizing Weddings in 2025

The Future of “I Do”: How Technology is Revolutionizing Weddings in 2025

March 19, 2025
Eldad Tamir

AI vs. Traditional Investing: How FINQ’s SEC RIA License Signals a New Era in Wealth Management

March 17, 2025
Overcoming Payment Challenges: How Waves Audio Streamlined Transactions with BridgerPay

Overcoming Payment Challenges: How Waves Audio Streamlined Transactions with BridgerPay

March 16, 2025
Arvatz and Iyer

PointFive and Emertel Forge Strategic Partnership to Elevate Enterprise FinOps in ANZ

March 13, 2025

Categories

  • AI & Robotics
  • Benzinga
  • Cybersecurity
  • FinTech
  • New York Tech
  • News
  • Startups & Leaders
  • Venture Capital

Tags

3D bio-printing acoustic AI Allseated B2B marketing Business carbon footprint climate change coding Collaborations Companies To Watch consumer tech cryptocurrency deforestation drones earphones Entrepreneur Fetcherr Finance Fintech food security Investing Investors investorsummit israelitech Leaders LinkedIn Leaders Metaverse news OurCrowd PR Real Estate reforestation software start- up startupnation Startups Startups On Demand startuptech Tech Tech leaders technology UAVs Unlimited Robotics VC
  • Contact Us
  • Privacy Policy
  • Terms and conditions

© 2024 All Rights Reserved - New York Tech Media

No Result
View All Result
  • News
  • FinTech
  • AI & Robotics
  • Cybersecurity
  • Startups & Leaders
  • Venture Capital

© 2024 All Rights Reserved - New York Tech Media