Created by: Mistress-Anna
What is this Python project?
WebOOB is a framework for scraping websites and aggregating data from multiple websites.
What's the difference between this Python project and similar ones?
- Routing model of URL patterns to multiple class of Page with all the parsing associated to each of those Pages, for cleaner code
- Scraping is made easy thanks to "declarative parsing": each Page can have a few XPaths, configure a few "filters" to apply on those XPaths (like parsing int, apply regex, etc.), and you're set!
- Like every high-level feature in WebOOB, this declarative parsing can be disabled locally, when it doesn't fit for a particular site, and it's always possible to fallback to plain-old procedural parsing code
- Pagination handling, supports infinite iterators
- Typed data models to ensure clean scraped data
- Can handle HTML/XML, JSON, and even XLS or PDF
- (Optional) Can aggregate data from multiple websites by grouping them in categories (for example "video sites", "banking sites", "public transport sites", "event sites", etc.)
- Comes builtin with a ~250 pre-existing website crawling backends
- Has a few graphical and command-line apps to explore and search the scraped data