New York Tech Media
  • News
  • FinTech
  • AI & Robotics
  • Cybersecurity
  • Startups & Leaders
  • Venture Capital
No Result
View All Result
  • News
  • FinTech
  • AI & Robotics
  • Cybersecurity
  • Startups & Leaders
  • Venture Capital
No Result
View All Result
New York Tech Media
No Result
View All Result
Home AI & Robotics

AI-Based Generative Writing Models Frequently ‘Copy and Paste’ Source Data

New York Tech Editorial Team by New York Tech Editorial Team
November 19, 2021
in AI & Robotics
0
AI-Based Generative Writing Models Frequently ‘Copy and Paste’ Source Data
Share on FacebookShare on Twitter

American playwright and entrepreneur Wilson Mizner is often famously quoted as saying ‘When you steal from one author, it’s plagiarism; if you steal from many, it’s research’.

Similarly, the assumption around the new generation of AI-based creative writing systems is that the vast amounts of data fed to them at the training stage have resulted in a genuine abstraction of high level concepts and ideas; that these systems have at their disposal the distilled wisdom of thousands of contributing authors, from which the AI can formulate innovative and original writing; and that those who use such systems can be certain that they’re not inadvertently indulging in plagiarism-by-proxy.

It’s a presumption that’s challenged by a new paper from a research consortium (including Facebook and Microsoft’s AI research divisions), which has found that machine learning generative language models such as the GPT series ‘occasionally copy even very long passages’ into their supposedly original output, without attribution.

In some cases, the authors note, GPT-2 will duplicate over 1,000 words from the training set in its output.

The paper is titled How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN, and is a collaboration between Johns Hopkins University, Microsoft Research, New York University and Facebook AI Research.

RAVEN

The study uses a new approach called RAVEN (RAtingVErbalNovelty), an acronym that has been entertainingly tortured to reflect the avian villain of a classic poem:

‘This acronym refers to “The Raven” by Edgar Allan Poe, in which the narrator encounters a mysterious raven which repeatedly cries out, “Nevermore!” The narrator cannot tell if the raven is simply repeating something that it heard a human say, or if it is constructing its own utterances (perhaps by combining never and more)—the same basic ambiguity that our paper addresses.’

The findings from the new paper come in the context of major growth for AI content-writing systems that seek to supplant ‘simple’ editing tasks, and even to write full-length content. One such system received $21 million in series A funding earlier this week.

The researchers note that ‘GPT-2 sometimes duplicates training passages that are over 1,000 words long.‘ (their emphasis), and that generative language systems propagate linguistic errors in the source data.

The language models studied under RAVEN were the GPT series of releases up to GPT-2 (the authors did not have access at that time to GPT-3), a Transformer, Transformer-XL, and an LSTM.

Novelty

The paper notes that GPT-2 coins Bush 2-style inflections such as ‘Swissified’, and derivations such as ‘IKEA-ness’, creating such novel words (they do not appear in GPT-2’s training data) on linguistic principles derived from higher dimensional spaces established during training.

The results also show that ‘74% of sentences generated by Transformer-XL have a syntactic structure that no training sentence has’, indicating, as the authors state, ‘neural language models do not simply memorize; instead they use productive processes that allow them to combine familiar parts in novel ways.’

So technically, the generalization and abstraction should produce innovative and novel text.

Data Duplication May Be the Problem

The paper theorizes that long and verbatim citations produced by Natural Language Generation (NLG) systems could become ‘baked’ whole into the AI model because the original source text is repeated multiple times in datasets that have not been adequately de-duplicated.

Though another research project has found that complete duplication of text can occur even if the source text only appears once in the dataset, the authors note that the project has different conceptual architectures from the common run of content-generating AI systems.

The authors also observe that changing the decoding component in language generation systems could increase novelty, but found in tests that this occurs at the expense of quality of output.

Further problems emerge as the datasets that fuel content-generating algorithms get ever larger. Besides aggravating issues around the affordability and viability of data pre-processing, as well as quality assurance and de-duplication of the data, many basic errors remain in source data, which then become propagated in the content output by the AI.

The authors note*:

‘Recent increases in training set sizes make it especially critical to check for novelty because the magnitude of these training sets can break our intuitions about what can be expected to occur naturally. For instance, some notable work in language acquisition relies on the assumption that regular past tense forms of irregular verbs (e.g., becomed, teached) do not appear in a learner’s experience, so if a learner produces such words, they must be novel to the learner.

‘However, it turns out that, for all 92 basic irregular verbs in English, the incorrect regular form appears in GPT-2’s training set.’

More Data Curation Needed

The paper contends that more attention needs to be paid to novelty in the formulation of generative language systems, with a particular emphasis on ensuring that the ‘withheld’ test portion of the data (the part of the source data that is set aside for testing how well the final algorithm has assessed the main body of trained data) is apposite for the task.

‘In machine learning, it is critical to evaluate models on a withheld test set. Due to the open-ended nature of text generation, a model’s generated text might be copied from the training set, in which case it is not withheld—so using that data to evaluate the model (e.g., for coherence or grammaticality) is not valid.’

The authors also contend that more care is also needed in the production of language models due to the Eliza effect, a syndrome identified in 1966 which identified “the susceptibility of people to read far more understanding than is warranted into strings of symbols—especially words—strung together by computers”.

 

* My conversion of inline citations to hyperlinks

 

Credit: Source link

Previous Post

This nontraditional Cleo Robotics indoor drone is now available to buy

Next Post

BTS 2021: Beyond Bengaluru Startup Grid launched, 40 startups already on board

New York Tech Editorial Team

New York Tech Editorial Team

New York Tech Media is a leading news publication that aims to provide the latest tech news, fintech, AI & robotics, cybersecurity, startups & leaders, venture capital, and much more!

Next Post
BTS 2021: Beyond Bengaluru Startup Grid launched, 40 startups already on board

BTS 2021: Beyond Bengaluru Startup Grid launched, 40 startups already on board

  • Trending
  • Comments
  • Latest
Meet the Top 10 K-Pop Artists Taking Over 2024

Meet the Top 10 K-Pop Artists Taking Over 2024

March 17, 2024
Panther for AWS allows security teams to monitor their AWS infrastructure in real-time

Many businesses lack a formal ransomware plan

March 29, 2022
Zach Mulcahey, 25 | Cover Story | Style Weekly

Zach Mulcahey, 25 | Cover Story | Style Weekly

March 29, 2022
How To Pitch The Investor: Ronen Menipaz, Founder of M51

How To Pitch The Investor: Ronen Menipaz, Founder of M51

March 29, 2022
10 Raunchy Movies on Netflix You Won’t Regret Watching

10 Raunchy Movies on Netflix You Won’t Regret Watching

May 20, 2024
Japanese Space Industry Startup “Synspective” Raises US $100 Million in Funding

Japanese Space Industry Startup “Synspective” Raises US $100 Million in Funding

March 29, 2022
Startups On Demand: renovai is the Netflix of Online Shopping

Startups On Demand: renovai is the Netflix of Online Shopping

2
Robot Company Offers $200K for Right to Use One Applicant’s Face and Voice ‘Forever’

Robot Company Offers $200K for Right to Use One Applicant’s Face and Voice ‘Forever’

1
Menashe Shani Accessibility High Tech on the low

Revolutionizing Accessibility: The Story of Purple Lens

1

Netgear announces a $1,500 Wi-Fi 6E mesh router

0
These apps let you customize Windows 11 to bring the taskbar back to life

These apps let you customize Windows 11 to bring the taskbar back to life

0
This bipedal robot uses propeller arms to slackline and skateboard

This bipedal robot uses propeller arms to slackline and skateboard

0
laptop on glass table

Automat-it Cuts Deployment Friction as Monce Scales AI Order Processing on AWS

April 13, 2026
Lee's Famous Recipe Chicken

Why Lee’s Famous Recipe Chicken Is Betting on Hi Auto to Quietly Rewire the Drive-Thru

April 9, 2026
computer generated image of letters

San Francisco Tribune Lists 11 HumanX Startups Moving AI Closer to the Operating Core

April 8, 2026
Impala CEO and Highrise AI CEO

The Industrialization of AI Infrastructure: What Impala and Highrise AI Reveal About the Next Scaling Frontier

April 7, 2026
Employee Time Tracking

What is an Employee Time Tracking Solution? A Definite Guide for 2026

March 31, 2026
Voltify founders

Voltify Raises $30 Million Seed Round as It Challenges $1 Trillion Rail Electrification Model

March 31, 2026

Recommended

laptop on glass table

Automat-it Cuts Deployment Friction as Monce Scales AI Order Processing on AWS

April 13, 2026
Lee's Famous Recipe Chicken

Why Lee’s Famous Recipe Chicken Is Betting on Hi Auto to Quietly Rewire the Drive-Thru

April 9, 2026
computer generated image of letters

San Francisco Tribune Lists 11 HumanX Startups Moving AI Closer to the Operating Core

April 8, 2026
Impala CEO and Highrise AI CEO

The Industrialization of AI Infrastructure: What Impala and Highrise AI Reveal About the Next Scaling Frontier

April 7, 2026

Categories

  • AI & Robotics
  • Benzinga
  • Cybersecurity
  • FinTech
  • New York Tech
  • News
  • Startups & Leaders
  • Venture Capital

Tags

AI AI QSRs Allseated Automat-it AWS B2B marketing Business CISO CISO Whisperer Collaborations Companies To Watch cryptocurrency Cybersecurity Entrepreneur Fetcherr Finance FINQ Fintech Funding Announcement hi-tech Hi Auto Impala Investing Investors investorsummit Israel israelitech Leaders LinkedIn Leaders Metaverse Mindset Minnesota omri hurwitz PointFive PR QSR Real Estate start- up startupnation Startups Startups On Demand Tech Tech leaders Unlimited Robotics VC
  • Contact Us
  • Privacy Policy
  • Terms and conditions

© 2024 All Rights Reserved - New York Tech Media

No Result
View All Result
  • News
  • FinTech
  • AI & Robotics
  • Cybersecurity
  • Startups & Leaders
  • Venture Capital

© 2024 All Rights Reserved - New York Tech Media