TechDogs-"Publishers Unblock OpenAI’s Crawler As TikTok’s Parent ByteDance Boasts A 25x Faster Web Scraper"

Emerging Technology

Publishers Unblock OpenAI’s Crawler As TikTok’s Parent ByteDance Boasts A 25x Faster Web Scraper

By Amrit Mehra

Updated on Tue, Oct 8, 2024

Overall Rating
Amid the various product upgrade announcements and a whole bunch of OpenAI royalty leaving the generative artificial intelligence (GenAI) leader’s kingdom for other pastures, the company revealed its extremely successful funding round, where it raised $6.6 billion at a valuation of $157 billion.

This round witnessed interest from Thrive Capital, Khosla Ventures, Microsoft, NVIDIA, Altimeter Capital, Fidelity, SoftBank and MGX.

However, the company also warned that it would face tremendous losses this year, amounting to over $5 billion.

Yet, OpenAI is optimistic about its growth prospects and projects a revenue of over $11 billion next year. This move will be aided by another recent reveal by the company, which included OpenAI confirming it was ditching its not-for-profit business model and opting for a for-profit motive.

Another silver lining for the company comes in reports that find publishers aren’t blocking its web crawler like they were before.

A move that came as these publishing houses disallowed artificial intelligence (AI) web crawlers from scraping their websites for content. This was done by updating the robots.txt file (AKA Robots Exclusion Protocol) to “disallow” AI web crawlers, including OpenAI’s GPTBot.

In fact, at the peak of this movement, over 33% of websites disallowed OpenAI’s crawler, a number that has since dropped to around 25%. When it comes to the more prominent news publishers, OpenAI has been able to pull down the block rate from 90% to 50%.

For OpenAI, this crawler is what empowers its prized ChatGPT product to generate valuable answers from user prompts and such chatbots require large amounts of data for training purposes.

In simple terms, this data serves as fodder for chatbots and publishers and news outlets aren’t blocking OpenAI’s crawler at the rate they initially were.

This may have something to do with the recent wave of partnerships OpenAI struck up with the likes of TIME, NewsCorp, Reddit, Condé Nast and a lot more content and publication houses.

However, not all news outlets that have unblocked OpenAI’s crawler have a deal in place or are looking for one. The Onion includes one such company, which attributes the unblocking to oversight when the company migrated to a new hosting service and content management system.

Onion’s CEO Ben Collins, dismissed potential deals by saying, “Obviously we are not doing any business with the Plagiarism Machine.”

TechDogs-"An Image With OpenAI's And ChatGPT's Logo"
On the other hand, popular social networking application TikTok’s parent company, ByteDance, is also stepping up its interest in scraping websites to train generative AI products.

This comes with a new web-scrapping tool that’s called Bytespider, which is said to be 25 times faster than OpenAI’s GPTbot and is around 3,000 times faster than Anthropic’s ClaudeBot.

Furthermore, ByteDance’s Bytespider doesn’t adhere to robots.txt files put up to block it, just like web crawlers used by other AI companies.

As such, there is no legal requirement to adhere to robots.txt files, so none of these companies are actually breaking any laws. However, it does bring up controversial issues in the form of copyright infringement and data privacy.

Ahead of this, many developers have tried blocking the web scraper to no avail, as it keeps changing its IP address and alternate, randomized user agents.

Another cause for concern is that this revelation comes at a time when the United States and China are at odds over the sharing of technologies, from GPUs to connected car tech, as well as the use of Chinese-owned digital platforms, one that has been focused around TikTok quite a bit.

This concern extends to Chinese companies engaged in the digital sector that have access to US-based data.

Do you think news outlets and publishing houses are right in unblocking OpenAI’s web crawler or do you think they should keep them out until the company’s GenAI products offer more reliable and secure services?

Let us know in the comments below!

First published on Tue, Oct 8, 2024

Enjoyed what you've read so far? Great news - there's more to explore!

Stay up to date with the latest news, a vast collection of tech articles including introductory guides, product reviews, trends and more, thought-provoking interviews, hottest AI blogs and entertaining tech memes.

Plus, get access to branded insights such as informative white papers, intriguing case studies, in-depth reports, enlightening videos and exciting events and webinars from industry-leading global brands.

Dive into TechDogs' treasure trove today and Know Your World of technology!

Disclaimer - Reference to any specific product, software or entity does not constitute an endorsement or recommendation by TechDogs nor should any data or content published be relied upon. The views expressed by TechDogs' members and guests are their own and their appearance on our site does not imply an endorsement of them or any entity they represent. Views and opinions expressed by TechDogs' Authors are those of the Authors and do not necessarily reflect the view of TechDogs or any of its officials. While we aim to provide valuable and helpful information, some content on TechDogs' site may not have been thoroughly reviewed for every detail or aspect. We encourage users to verify any information independently where necessary.

Join The Discussion

Join Our Newsletter

Get weekly news, engaging articles, and career tips-all free!

By subscribing to our newsletter, you're cool with our terms and conditions and agree to our Privacy Policy.

  • Dark
  • Light