A web scraping CLI made for AI that is idempotent

81 points by clemlesne a year ago · 33 comments

Reader

I have a similar project. I scrape pages only to obtain page meta. It can use selenium, crawlee, i will also add puppeteer later.

The project is quite big, has mamy features.

It is my internet command center. I used it to check what's news on the internet.

https://github.com/rumca-js/Django-link-archive

nilsherzig a year ago

Love the images haha

usernamed7 a year ago

Not to discount any actual utility or innovation here, but I was wondering "why would you be hard coding to all these azure services?" then I saw the author is a solutions architect at microsoft.

so this is likely part of Microsoft's AI strategy to lure developers in and create dependence. Doesn't mean it can't also be interesting/good, but it's important context to this project's purpose and goals.

clemlesneOP a year ago

I’m developing that in my free time because I think there is a need inside the community for that. I’m not motivated in any way by my company.
In the meantime, if you have other technologies achieving the features (blob, queue, search), feel free to push a PR. Someone already did that for AWS: https://github.com/clemlesne/scrape-it-now/issues/8.

cha-d a year ago

Am I right in thinking running this regularly from your computer at home will start causing you to start receiving more Captchas over time? If so, what are some other options?

wcallahan a year ago

Nice work. Would love a similar repository for Google cloud’s equivalent services!

Or a PR on this that accomplishes the same, as @clemlesne mentioned.

katella a year ago

Does it just scrape all pages of a site?

clemlesneOP a year ago

It saves page content, transform it to markdown, and import it (optionally) to a search database to perform semantic (sentences) searches.

mrdw a year ago

why so dependent on azure?

"Decoupled architecture with Azure Queue Storage"

"Scraped content is stored in Azure Blob Storage"

"Indexed content is semantically searchable with Azure AI Search"

clemlesneOP a year ago

Well it solve basics problems like queuing and blob storage. For example, to achieve the same features as Queue Storage, you should use RabbitMQ or similar: in enterprise environment, it means multiple instances in high availability, maintenance, people to deploy it reproductibly…

bbor a year ago

lol I love the cheeky `[ ] respect robots.txt` mention. I was all worried about this for my own system, but shocked to find out there’s a ton of projects openly built around breaking the law (/social protocol). Is the justification just the same as pirating entertainment, ie “big companies are bad” and/or “IP is unjustified”?

This one in particular doesn’t fit my exact use case I don’t think, but I love the repo, very clearly explained. Well done! I hadn’t even thought about ads until just now, that’s an interesting problem…

deisteve a year ago

i just don't get why people use web scraping as a battleground for moral ethics
its bizarre just like equating copyright infringement to theft of property.
where does this moral high ground come from? nobody scraping is thinking "oh im so evil im scraping without respecting robot.txt and using residential ip addresses to bypass detection"
Google does it nobody has a problem but when the little guy does it suddenly they are an outlaw.
- trog a year ago
  
  > Google does it nobody has a problem
  Historically, when Google did it, they did it to create an index, which a lot of people found useful as a way to find information they were looking for. This used to mean people would come and visit your website, where they could engage with the website creator directly through a variety of different means.
  Google doing it now to digest all the content and mulch it all together to return a regurgitated form of it is a very different proposition, and that is what people are annoyed about when "the little guys" (funny name for startups with multiple billions of dollars of raised capital) are doing the same thing.
  For many it's not about "moral ethics", it's about actual survival. If nobody is visiting their website, nobody is buying their products or engaging with their community or whatever.
  If you're scraping content for no other purpose than to mechanistically reword it for commercial purposes, then it's not really surprising that people have issues with it.
- __loam a year ago
  
  You're taking someone else's labor and profiting off it, without any credit or compensation. To add insult to injury, the person you're scraping pays money to support your traffic. It's a one sided transaction.
  - 9dev a year ago
    
    You can’t generalise that. Maybe I crawl to provide an annotated preview of their website, to make users of my application more likely to click the link and visit it? There are lots of ways in which crawling benefits everyone, it just requires some mutual respect.
  - deisteve a year ago
    
    We don't live in the 90s anymore. The bandwidth CPU cost is moot unless you are spinning up thousands of GPUs to render an HTML page.
    Also the claims of someone else's labor and profiteering is exactly what Google does
    
    __loam a year ago
    
    What Google does is mutually beneficial. Stealing content to reheat and serve in an LLM is very different.
    
    deisteve a year ago
    
    Google that creates monopoly on top of the scraped data and uses third party content to do so?
    LLM just places that ability at the ends of the user (local LLM strictly speaking)
    
    __loam a year ago
    
    No it doesn't. The end state of a Google search is a human interacting directly with a site.
hipadev23 a year ago

> breaking the law
citation needed
kordlessagain a year ago

Since when is intent to implement a feature "cheeky"?
> but shocked to find out there’s a ton of projects openly built around breaking the law
The original statement oversimplifies a complex legal and ethical landscape in technology. It fails to account for the gradual nature of discovering various projects with potential legal implications, instead projecting an unrealistic sudden shock. This overlooks the nuanced reality of how technology often operates in legal gray areas, especially when dealing with emerging fields or novel applications of existing tech.
The assertion of widespread illegality ignores crucial legal concepts like fair use, which provides lawful ways to utilize publicly available information under certain circumstances. For instance, web crawling for legitimate purposes, including research or analysis that falls under fair use, can be perfectly legal despite potential objections from website owners.
Furthermore, the statement disregards the principle that information openly published on the internet, without robust privacy protections, may often be legally utilized in ways the publisher didn't anticipate. This reflects a misunderstanding of how modern information ecosystems function and the legal frameworks governing them. By presenting a black-and-white view of legality in tech projects, the original statement hinders a more sophisticated understanding of the intricate balance between innovation, law, and ethical considerations in the digital age. It's crucial to approach these issues with a nuanced perspective that acknowledges the complexities of applying traditional legal concepts to rapidly evolving technologies and practices.
clemlesneOP a year ago

That's indeed in the roadmap, like you mentioned.
My primary objective is to build a LLM chat tool based on open-source documentations. The project owner (and even more if it is OSS) is I think not responsible for that, the one using it is.
You are welcome to push a PR to add other backends (including OSS)!
CalRobert a year ago

What law are you thinking of?
clemlesneOP a year ago

And thank you for the compliment! It's great to see that your efforts are seen and appreciated :)
bearjaws a year ago

At this point honoring robots.txt will ensure you have a terrible searching experience...
lyime a year ago

Ignoring robots.txt is "not breaking the law" lol
gmerc a year ago

Eric Schmidt has you covered. You do it to win and the law isn’t for tech bros, it’s for suckers who can’t pay a lawyer

Settings

A web scraping CLI made for AI that is idempotent

Keyboard Shortcuts