New York Tech Media
  • News
  • FinTech
  • AI & Robotics
  • Cybersecurity
  • Startups & Leaders
  • Venture Capital
No Result
View All Result
  • News
  • FinTech
  • AI & Robotics
  • Cybersecurity
  • Startups & Leaders
  • Venture Capital
No Result
View All Result
New York Tech Media
No Result
View All Result
Home AI & Robotics

Human Data Preparation for Machine Learning Is Resource-Intensive: These Two Approaches are Critical for Reducing Costs

New York Tech Editorial Team by New York Tech Editorial Team
March 7, 2022
in AI & Robotics
0
Human Data Preparation for Machine Learning Is Resource-Intensive: These Two Approaches are Critical for Reducing Costs
Share on FacebookShare on Twitter

By: Dattaraj Rao, Chief Data Scientist, Persistent Systems

As with any system that depends on data inputs, Machine Learning (ML) is subject to the axiom of “garbage-in-garbage-out.” Clean and accurately labeled data is the foundation for building any ML model. An ML training algorithm understands patterns from the ground-truth data and from there, learns ways to generalize on unseen data. If the quality of your training data is low, then it will be very difficult for the ML algorithm to continuously learn and extrapolate.

Think about it in terms of training a pet dog. If you fail to properly train the dog with fundamental behavioral commands (inputs) or do it incorrectly/inaccurately, you can never expect the dog to learn and expand through observation into more complex positive behaviors because the underlying inputs were absent or flawed, to begin with. Proper training is time-intensive and even costly if you bring in an expert, but the payoff is great if you do it right from the start.

When training an ML model, creating quality data requires a domain expert to spend time annotating the data. This may include selecting a window with the desired object in an image or assigning a label to a text entry or a database record. Particularly for unstructured data like images, videos, and text, annotation quality plays a major role in determining model quality. Usually, unlabeled data like raw images and text is abundant – but labeling is where effort needs to be optimized. This is the human-in-the-loop part of the ML lifecycle and usually is the most expensive and labor-intensive part of any ML project.

Data annotation tools like Prodigy, Amazon Sagemaker Ground Truth, NVIDIA RAPIDS, and DataRobot human-in-the-loop are constantly improving in quality and providing intuitive interfaces for domain experts. However, minimizing the time needed by domain experts to annotate data is still a significant challenge for enterprises today – especially in an environment where data science talent is limited yet in high demand. This is where two new approaches to data preparation come into play.

Active Learning

Active learning is a method where an ML model actively queries a domain expert for specific annotations. Here, the focus is not on getting a complete annotation on unlabeled data, but just getting the right data points annotated so that model can learn better. Take for example healthcare & life sciences, a diagnostic company that specializes in early cancer detection to help clinicians make informed data-driven decisions about patient care. As part of their diagnosis process, they need to annotate CT scan images with tumors that need to be highlighted.

After the ML model learns from a few images with tumor blocks marked, with active learning, the model will then only ask users to annotate images where it is unsure of the presence of a tumor. These will be boundary points, which when annotated will increase the confidence of the model. Where the model is confident above a particular threshold, it will do a self-annotation rather than asking the user to annotate. This is how active learning tries to help build accurate models while reducing the time and effort required to annotate data. Frameworks like modAL can help to increase classification performance by intelligently querying domain experts to label the most informative instances.

Weak Supervision

Weak supervision is an approach where noisy and imprecise data or abstract concepts can be used to provide indications for labeling a large amount of unsupervised data. This approach usually makes use of weak labelers and tries to combine these in an ensemble approach to build quality annotated data. The effort is to try to incorporate domain knowledge into an automated labeling activity.

For example, if an Internet Service Provider (ISP) needed a system to flag email datasets as spam or not spam, we could write weak rules such as checking for phrases like “offer”, “congratulations”, “free”, etc., which mostly are associated with spam emails. Other rules could be emails from specific patterns of source addresses that can be searched by regular expressions. These weak functions could then be combined by a weak supervision framework like Snorkel and Skweak to build improved quality training data.

ML at its core is about helping companies scale processes exponentially in ways that are physically impossible to achieve manually. However, ML is not magic and still relies on humans to a) set up and train the models properly from the start and b) intervene when needed to ensure the model doesn’t become so far skewed to where the results are no longer useful and may be counterproductive or negative.

The goal is to find ways that help streamline and automate parts of the human involvement to increase time-to-market and results but while staying in the guardrails of optimal accuracy. It is universally accepted that getting quality annotated data is the most expensive but extremely important part of a ML project. This is an evolving space, and a lot of effort is underway to reduce time spent by domain experts and improve the quality of data annotations. Exploring and leveraging active learning and weak supervision is a solid strategy to achieve this across multiple industries and use cases.

Credit: Source link

Previous Post

Series A, B and C Startup Funding Explained • Benzinga

Next Post

Paul Hastings Adds Veteran DOJ Atty In Fintech Hiring Spree

New York Tech Editorial Team

New York Tech Editorial Team

New York Tech Media is a leading news publication that aims to provide the latest tech news, fintech, AI & robotics, cybersecurity, startups & leaders, venture capital, and much more!

Next Post
VC Firm Fights To Block £7.6M Rosenblatt Legal Bill

Paul Hastings Adds Veteran DOJ Atty In Fintech Hiring Spree

  • Trending
  • Comments
  • Latest
Meet the Top 10 K-Pop Artists Taking Over 2024

Meet the Top 10 K-Pop Artists Taking Over 2024

March 17, 2024
Panther for AWS allows security teams to monitor their AWS infrastructure in real-time

Many businesses lack a formal ransomware plan

March 29, 2022
Zach Mulcahey, 25 | Cover Story | Style Weekly

Zach Mulcahey, 25 | Cover Story | Style Weekly

March 29, 2022
How To Pitch The Investor: Ronen Menipaz, Founder of M51

How To Pitch The Investor: Ronen Menipaz, Founder of M51

March 29, 2022
Japanese Space Industry Startup “Synspective” Raises US $100 Million in Funding

Japanese Space Industry Startup “Synspective” Raises US $100 Million in Funding

March 29, 2022
UK VC fund performance up on last year

VC-backed Aerium develops antibody treatment for Covid-19

March 29, 2022
Startups On Demand: renovai is the Netflix of Online Shopping

Startups On Demand: renovai is the Netflix of Online Shopping

2
Robot Company Offers $200K for Right to Use One Applicant’s Face and Voice ‘Forever’

Robot Company Offers $200K for Right to Use One Applicant’s Face and Voice ‘Forever’

1
Menashe Shani Accessibility High Tech on the low

Revolutionizing Accessibility: The Story of Purple Lens

1

Netgear announces a $1,500 Wi-Fi 6E mesh router

0
These apps let you customize Windows 11 to bring the taskbar back to life

These apps let you customize Windows 11 to bring the taskbar back to life

0
This bipedal robot uses propeller arms to slackline and skateboard

This bipedal robot uses propeller arms to slackline and skateboard

0
Coffee Nova’s $COFFEE Token

Coffee Nova’s $COFFEE Token

May 29, 2025
Money TLV website

BridgerPay to Spotlight Cross-Border Payments Innovation at Money TLV 2025

May 27, 2025
The Future of Software Development: Why Low-Code Is Here to Stay

Building Brand Loyalty Starts With Your Team

May 23, 2025
Tork Media Expands Digital Reach with Acquisition of NewsBlaze and Buzzworthy

Creative Swag Ideas for Hackathons & Launch Parties

May 23, 2025
Tork Media Expands Digital Reach with Acquisition of NewsBlaze and Buzzworthy

Strengthening Cloud Security With Automation

May 22, 2025
How Local IT Services in Anderson Can Boost Your Business Efficiency

Why VPNs Are a Must for Entrepreneurs in Asia

May 22, 2025

Recommended

Coffee Nova’s $COFFEE Token

Coffee Nova’s $COFFEE Token

May 29, 2025
Money TLV website

BridgerPay to Spotlight Cross-Border Payments Innovation at Money TLV 2025

May 27, 2025
The Future of Software Development: Why Low-Code Is Here to Stay

Building Brand Loyalty Starts With Your Team

May 23, 2025
Tork Media Expands Digital Reach with Acquisition of NewsBlaze and Buzzworthy

Creative Swag Ideas for Hackathons & Launch Parties

May 23, 2025

Categories

  • AI & Robotics
  • Benzinga
  • Cybersecurity
  • FinTech
  • New York Tech
  • News
  • Startups & Leaders
  • Venture Capital

Tags

3D bio-printing acoustic AI Allseated B2B marketing Business carbon footprint climate change coding Collaborations Companies To Watch consumer tech crypto cryptocurrency deforestation drones earphones Entrepreneur Fetcherr Finance Fintech food security Investing Investors investorsummit israelitech Leaders LinkedIn Leaders Metaverse news OurCrowd PR Real Estate reforestation software start- up Startups Startups On Demand startuptech Tech Tech leaders technology UAVs Unlimited Robotics VC
  • Contact Us
  • Privacy Policy
  • Terms and conditions

© 2024 All Rights Reserved - New York Tech Media

No Result
View All Result
  • News
  • FinTech
  • AI & Robotics
  • Cybersecurity
  • Startups & Leaders
  • Venture Capital

© 2024 All Rights Reserved - New York Tech Media