How to Use rs-trafilatura with Scrapy

Scrapy is the standard Python framework for web scraping. It handles crawling, scheduling, and data pipelines. rs-trafilatura plugs into Scrapy as an item pipeline — your spider yields items with H...

By · · 1 min read
How to Use rs-trafilatura with Scrapy

Source: DEV Community

Scrapy is the standard Python framework for web scraping. It handles crawling, scheduling, and data pipelines. rs-trafilatura plugs into Scrapy as an item pipeline — your spider yields items with HTML, and the pipeline adds structured extraction results automatically. Install pip install rs-trafilatura scrapy Setup Add the pipeline to your Scrapy project's settings.py: ITEM_PIPELINES = { "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300, } That's it. Every item that passes through the pipeline with a body (bytes) or html (string) field will get an extraction dict added to it. Writing the Spider Your spider yields items with the response body and URL: import scrapy class ContentSpider(scrapy.Spider): name = "content" start_urls = ["https://example.com"] def parse(self, response): yield { "url": response.url, "body": response.body, # raw bytes — rs-trafilatura auto-detects encoding } # Follow links for href in response.css("a::attr(href)").getall(): yield response.follow(href, self.pars

Related Posts

Trending on ShareHub

  1. Understanding Modern JavaScript Frameworks in 2026
    by Alex Chen · Feb 12, 2026 · 0 likes
  2. The System Design Primer
    by Sarah Kim · Feb 12, 2026 · 0 likes
  3. Just shipped my first open-source project!
    by Alex Chen · Feb 12, 2026 · 0 likes
  4. OpenAI Blog
    by Sarah Kim · Feb 12, 2026 · 0 likes
  5. Building Accessible Web Applications: A Practical Guide
    by Alex Chen · Feb 12, 2026 · 0 likes
  6. Rapper Lil Poppa dead at 25, days after releasing new music
    Rapper Lil Poppa dead at 25, days after releasing new music
    by Anonymous User · Feb 19, 2026 · 0 likes
  7. write-for-us
    by Volt Raven · Mar 7, 2026 · 0 likes
  8. Before the Coffee Gets Cold: Heartfelt Story of Time Travel and Second Chances
    Before the Coffee Gets Cold: Heartfelt Story of Time Travel and Second Chances
    by Anonymous User · Feb 12, 2026 · 0 likes
    #coffee gets cold #the #time travel
  9. Best DoorDash Promo Code Reddit Finds for Top Discounts
    Best DoorDash Promo Code Reddit Finds for Top Discounts
    by Anonymous User · Feb 12, 2026 · 0 likes
    #doordash #promo #reddit
  10. Premium SEO Services That Boost Rankings & Revenue | VirtualSEO.Expert
    by Anonymous User · Feb 12, 2026 · 0 likes
  11. NBC under fire for commentary about Team USA women's hockey team
    NBC under fire for commentary about Team USA women's hockey team
    by Anonymous User · Feb 18, 2026 · 0 likes
  12. Where to Watch The Nanny: Streaming and Online Viewing Options
    Where to Watch The Nanny: Streaming and Online Viewing Options
    by Anonymous User · Feb 12, 2026 · 0 likes
    #streaming #the nanny #where
  13. How Much Is Kindle Unlimited? Subscription Cost and Plan Details
    How Much Is Kindle Unlimited? Subscription Cost and Plan Details
    by Anonymous User · Feb 12, 2026 · 0 likes
    #kindle unlimited #subscription #unlimited
  14. Russian skater facing backlash for comment about Amber Glenn
    Russian skater facing backlash for comment about Amber Glenn
    by Anonymous User · Feb 18, 2026 · 0 likes
  15. Google News
    Google News
    by Anonymous User · Feb 18, 2026 · 0 likes

Latest on ShareHub

Browse Topics

#ai (3867)#news (2324)#webdev (1642)#programming (1167)#business (1142)#opensource (975)#security (898)#productivity (852)#/business (825)#javascript (713)

Around the Network