Deep-Fried Data

560 points by udba 9 years ago · 148 comments

Reader

> For the generation growing up now, the Internet is their window on the world. They take it for granted. It’s only us, who have seen it take shape, and are aware of all the ways it could have been different, who understand that it's fragile, contingent. The coming years will decide to what extent the Internet be a medium for consumption, to what extent it will lift people up, and to what extent it will become a tool of social control.

I agree completely. This is something we should be cognizant of.

pimlottc 9 years ago

> Many [programmers] work jobs that are intellectually stimulating, but ultimately leave nothing behind. There is a large population of technical people who would enjoy contributing to something lasting.

This hits pretty close to home.

keyle 9 years ago

Same here. I'm battling with this thought a lot. Beyond jobs, I think there should be communities of developers, designers, producers, writers, getting together and figuring out this stuff. And I don't mean open source projects. Let's group together smart people wanting to make a difference and have a hit list of things we (people) actually need. A group that would organise people into mission driven development.
I'm so fed up of getting paid to potentially make founders rich. Or to be a small cog in a gigantic machine on a slow decline. I'm also unemployable because I can't buy into the corporate BS anymore. And where I am, there doesn't seem to be design/dev jobs that actually want to make a difference. It's an economy problem.
The startup thing seems to be the best way we go about solving problems in the world today. But if you happen to _not_ be at the right place at the right time, meeting the right people, poof, it's gone. I can't imagine that an advanced specie would operate this way. We should be focused on solving problems, instead of being focused on escaping the rat race, to then be able to solve problems.
I'm glad I am not alone seeking purpose. There goes a point where you're technically advanced, you have itches to fix things and all you see is the broken economy of consumerism and "let's give kids video clips and smileys, derp".
- wtracy 9 years ago
  
  > And I don't mean open source projects.
  Then what do you mean? You described exactly what some of the largest, most successful FOSS projects (Firefox, KDE, Gnome, Libre Office, FreeBSD) are already doing.
  > Let's group together smart people wanting to make a difference and have a hit list of things we (people) actually need.
  Well, the FSF maintains a list of "high priority Free Software projects" that need help, but it's strongly colored by the FSF's politics: http://www.fsf.org/campaigns/priority-projects/
  - Natanael_L 9 years ago
    
    Sounds like his thought is more about building infrastructure, and even non-technical solutions (as in rethinking policy, scientific research projects, etc).
    Like a global collaborative project to improve just about everything.
  - keyle 9 years ago
    
    Thanks that FSF list is great actually.
  - coldtea 9 years ago
    
    >You described exactly what some of the largest, most successful FOSS projects (Firefox, KDE, Gnome, Libre Office, FreeBSD) are already doing.
    They "make a difference"? How exactly? At best, I can understand that for Firefox.
    
    sho_hn 9 years ago
    
    > They "make a difference"? How exactly?
    KDE so far:
    * Is used productively by large numbers of users doing cool stuff, from your fellow developers to the scientists controlling the Large Hadron Collider to the VFX artists who made Gravity
    * Helped popularize open source as a development and licensing model, which now comprises much of the industry
    * Helped Linux get attenttion and grow, ditto
    * Wrote technology in use on billions of computers (KHTML -> WebKit, large parts of Qt via tight symbiosis, taglib, ...)
    * Had a strong effect on industry tool choices (e.g. raising up CMake into the de-facto cross-platform C++ build system by adopting and helping improve it with requests and code, or hosting the valgrind bug tracker, or helping make SVN scale after adopting it as replacement for CVS)
    * Has made many hundreds of people more experienced and competent engineers through providing mentorship, and enabled them to apply those skills and that knowledge elsewhere
    Most of these are ongoing.
    Considering how easy it is in this industry to spend your days making things no one needs or wants, or worse, actively screws over people, I'm pretty happy with how I spent my last ~12 years as a KDE developer :)
    
    wtracy 9 years ago
    
    By providing quality software that lets me and thousands of others get useful work done, and not forcing us to accept onerous licensing terms in the process?
    But hey, none of those can cure cancer, so what's the point, right?
    
    coldtea 9 years ago
    
    >By providing quality software that lets me and thousands of others get useful work done, and not forcing us to accept onerous licensing terms in the process?
    That software (desktop Linux/UNIX) comes for free, and yet, only 1% or less (from browser stats of major traffic points) seem to opt to use it as their desktop.
    Is this the kind of difference the parent was describing? Letting a small minority of people avoid "onerous licensing terms" that billions of others are OK and can get "get useful work done" with?
    In the server, of course, it's a whole different story.
    
    sangnoir 9 years ago
    
    > That software (desktop Linux/UNIX) comes for free, and yet, only 1% or less (from browser stats of major traffic points) seem to opt to use it as their desktop.
    >Is this the kind of difference the parent was describing?
    Cancer kills 171.2 per 100,000[1]. So by your metrics, the Linux desktop folks make a bigger difference than curing cancer as 1% > 0.1712%
    1. Cancer mortility. see https://www.cancer.gov/about-cancer/understanding/statistics
    
    coldtea 9 years ago
    
    >Cancer kills 171.2 per 100,000[1]. So by your metrics, the Linux desktop folks make a bigger difference than curing cancer as 1% > 0.1712%
    This is the kind of illogical result stemming from only reasoning half-way.
    First of all, the cure for cancer wouldn't affect only the ones that die but also the ones that don't but do suffer complications from current treatment, from going broke from paying for therapy/losing their job in the process, to severe chemo side-effects. It also hugely affects the families and loved ones of those who currently die of cancer.
    Second, ever considered the kind and magnitude of impact? Saving even 171 persons (those in a single bunch of 100,000) from dying from cancer is, arguably, quite a bigger deal than sparing millions to have to use Windows or OS X or some commercial UNIX.
    
    sangnoir 9 years ago
    
    >>Cancer kills 171.2 per 100,000[1]. So by your metrics, the Linux desktop folks make a bigger difference than curing cancer as 1% > 0.1712%
    > This is the kind of illogical result stemming from only reasoning half-way.
    I'm glad you saw the flaws in your reasoning - the key phrase in that paragraph was "by your metrics".
    > Second, ever considered the kind and magnitude of impact?
    My point exactly!! It's not just about numbers/proportion game where you get to say "1% or less [desktop Linux usage] is not a big difference" since it means the world to those who depend on it, for example those who cannot afford Windows licenses or an Apple computer or who find non-free software unconscionable.
- intended 9 years ago
  
  That used to be what the net and all of the coder based communities were about. Thats why I used to come to slashdot, or HN when I found the net.
  And we've reached that point when the creatures (the firms and technologies built by those people) of that culture are diverging from ideals of the culture.
  But the above comment is inherently empty - any successful system will eventually expand till it reaches a barrier of complexity which cannot be overcome on its own.
  Figuring out what to do is the challenge.
  The things that worked were coders having free time to spend on interesting projects. But I suspect, that we've better understood the value of coder time, and the major firms are now paying the correct amount to keep coders busy.
  The market BS is a good thing for coders in the short and medium term. People who understand finance and strategy are willing to pay what it takes today, to own a chance at being richer tomorrow.
  If you have a neat hobby? communities will help you get better at it. Maybe if its really good, you can convert that into a product/firm and possibly a good exit. If that happens you won't have to worry about it ever, and you'll be that thing which is respected among your peers - a succesful serial entrepreneur. You would have done the hard thing (product creation, team management, finance, successful exit).
  In a group of people who respect ability and excellence, its hard not to think of the guy who did the harder job as meritorious.
  In short: I don't think theres a market solution for a new market normal.
  Its easier to figure out you are cut of a different cloth, recognize the market dynamics for what they are, and make time to build whatever it is you want to build.
  Eventually a lot of other coders are going to come to similar realizations (provided the cultural bubble online isn't too distortive)
- pjc50 9 years ago
  
  Some friends of mine came up with https://www.mysociety.org/ as an answer to this.
  > "We should be focused on solving problems, instead of being focused on escaping the rat race, to then be able to solve problems."
  Sometimes one person's problem is another person's solution, and vice versa; this is where so much of politics comes from. A lot of people are invested in making sure certain problems stay unsolved, or even unacknowledged.
- vijayr 9 years ago
  
  instead of being focused on escaping the rat race
  This resonates. How would it work in practice though? Many people want to contribute, but they are stuck as they have to provide for their family.
  Also, how do we know what to work on? Someone linked to FSF projects - are there other lists, places we can go to, to find actual tasks/projects to work on? Unless there are incentives (not money/fame, but seeing our efforts put to good use/getting feedback etc) people would lose interest. It would be awesome if we can curate such a list - pharma, food/agriculture, mental health etc. And break down this list into smaller, manageable tasks. I guess many people can find 5-10 hours a week to contribute.
  - keyle 9 years ago
    
    Yeah you get what I meant. There is a disconnect with what we should be doing and the current state of having to provide for our families.
    That said, I'm no communist and I'm not advocating that. I just find it sad that we're more rewarding people working on farming other people's holidays photos rather than pushing humanity forward (and no I don't meant going to Mars, although that gave me a kick!).
- odbol_ 9 years ago
  
  I think the First Things First manifesto is a good start: http://firstthingsfirst2014.org
  - keyle 9 years ago
    
    That is great thank you.
leesalminen 9 years ago

Same here. I was just talking with a friend last night about how software developers should have a creed, like how doctors have the Hippocratic oath. We could potentially cause far more harm than doctors nowadays.
- mamp 9 years ago
  
  Interesting, but note that the Hippocratic Oath doesn't work so well with preventable errors in health care the 3rd leading cause of death in the US https://www.washingtonpost.com/news/to-your-health/wp/2016/0...
- bthdonohue 9 years ago
  
  As part of my computer science program in college, we studied the ACM's Software Engineering Code of Ethics and Professional Practice, which in effect is what you describe. http://www.acm.org/about/se-code
  - cmrx64 9 years ago
    
    And as an aside, the ACM is currently collecting comments on revisions to their code of conduct. http://ethics.acm.org/code-of-ethics/code-2018/

Asparagirl 9 years ago

> I’ve saluted the efforts of Archive Team and the Internet Archive, but their activity is like having a museum curator that rides around in a fire truck, looking for burning buildings to pull antiques from. It's heroic, it's admirable, but it’s no way to run a culture.

...but in the meantime, here's an obligatory and shameless plug for donating to the Internet Archive[1] (tax-deductible in the US), or better yet making a recurring monthly donation so they can more accurately forecast revenue for the year, or better still getting your employer to make a nice big donation to this crucial bit of Internet memorybanks.

And as for Archive Team, we're always looking for a few good geeks.[2] Run an instance of the Warrior on spare cloud servers, or help patch and ship code at GitHub.[3]

[1] http://archive.org/donate/

[2] http://archiveteam.org/index.php?title=Main_Page

[3] https://github.com/ArchiveTeam/ArchiveBot

dcposch 9 years ago

The Archive is awesome, but the author's sensationalist description of what they do isn't really accurate.
For the most part, archive.org is not rushing in to save stuff that's about to be deleted.
Instead they are crawling the web 24/7, patiently maintaining a historical record.
Check out http://oldweb.today
It is amazing
- polpo 9 years ago
  
  What the Archive Team saves does get uploaded to the Internet Archive, but they aren't officially part of it. I think Maciej's description of the Archive Team is accurate - they are the archivists of last resort. When a commercial service is about to disappear forever, they're the ones that spring into action and rescue as much data as possible. If companies and the people that comprise them cared enough about their users' data, there would be no need for the Archive Team.
- 0xcde4c3db 9 years ago
  
  > For the most part, archive.org is not rushing in to save stuff that's about to be deleted.
  Parent mentioned both the Internet Archive and Archive Team. You're right about the Internet Archive, but "rushing in to save stuff that's about to be deleted" is a pretty apt description of most of Archive Team's activity.
- larsiusprime 9 years ago
  
  "is not rushing in to save stuff that's about to be deleted" > I took the burning building metaphor as as a somewhat fanciful, but otherwise accurate description of the natural state of the web -- a series of loosely linked html pages that could disappear at any minute (and often do) as soon as the hosting expires, the author stops maintaining them, or the company reorganizes, etc. As a simple exercise, go browse a popular blog from 2008 or so and count the broken links.
codezero 9 years ago

I've got a VPS sitting around doing very little – what's the easiest way to get started?
- Asparagirl 9 years ago
  
  Run a Warrior! Many flavors available: VirtualBox, Dockerfile, AMI (for Amazon EC2), you name it.
  http://archiveteam.org/index.php?title=ArchiveTeam_Warrior
  It would deeply unethical for me to point out that you could also run the Warrior on free server space that your company might not notice, kind of like the karmic inverse of a bitcoin miner. Deeply unethical. So I won't mention it.

chewxy 9 years ago

Regarding Maciej's fears about machine learning -

I've written about this before, and even right now I'm not sure where I stand exactly, except that tweaking the algorithms to compensate for bias is definitely not the right answer: if you look at the mirror and don't like what you see, you don't draw on top of the mirror to accentuate the result! You go on a diet!

I liked the idea of data gardening, but the thought of going-to-communities is daunting. I get tired even thinking about it.

Regarding living beyond walled gardens:

> Publish your texts as text. Let the images be images. Put them behind URLs and then commit to keeping them there. A URL should be a promise.

But people already do that! The question now is to turn to why people do otherwise. I personally do not understand the reason people say, post long blogposts on Facebook, but I do understand for services like Medium.

For example, I'm extremely tempted to write on Medium because it provides the network effects of readers clicking on tags to read next. So the question is how do we democratize that?

big_surprise 9 years ago
```
  So the question is how do we democratize that?
```
Commenter wtracy has already linked to the FSF's list of High Priority Free Software Projects... From there, look into what they have to say about free wifi (and in particular, but not limited to the OrangeMesh package): http://www.fsf.org/campaigns/priority-projects/free-software...
If a convincing case could be made that the benefits to National Security outweigh the costs to the copyright cartels, I'd be willing to bet that young secondary-schoolers would have a blast with a decently designed curriculum that includes a working student-to-student mesh-network as one of its goals.
- chewxy 9 years ago
  
  I actually meant democratizing the network effect that Medium has. The "free marketing" bit.
  I mean, right now one can pretty freely go write up a blog using self-hosted wordpress, octopress, pelican, hugo or whatever. But choosing that over Medium can sometimes mean a lot more work to put in. But if we can democratize the ease-of-use and the good bits of Medium/Facebook/Twitter... Maciej's end statement about "using open standards, write text in text, images in images" would have been achieved.
  The problem is that corporations now create a significantly more compelling version (in most criteria - UX, UI, etc) of the Free and Open versions out there.
  - big_surprise 9 years ago
    
    Right, I got that. I guess I meant to say- start young. Provide a compelling alternative to ad-driven consumerism... Otherwise, it seems that you're just talking about creating another new corporation with an more compelling (even if nonsensical) reason to adopt.

makomk 9 years ago

Have to admit that I didn't expect to see that quirk of LiveJournal culture mentioned in an article on the HN front page, let alone in a speech to the Library of Congress. It just sort of faded away without really influencing the current generation of social networks.

Also, it's funny how the net changes, how unthinkable it is to have a social network that doesn't slice up people's data and use it to advertise to them now compared to how anti-advertising LiveJournal was back then. Not convinced it's a change for the better.

idlewords 9 years ago

I'm the guy who gave this talk. To add to the funny, LiveJournal hired me to rewrite their ad engine in 2007. I did a horrible job at it, but turned my ineptitude into a principled and lucrative ideological stance that I have milked ever since.
Don't be afraid to pivot.
- bo1024 9 years ago
  
  I enjoyed reading it a lot, thank you. I think about these issues a lot and you gave me new things to think about / crystallized some nice perspectives.
- embarcadero 9 years ago
  
  Was this recorded? Link?
  - pauldino 9 years ago
    
    There's a YouTube link at the top of the page. https://www.youtube.com/watch?v=8gcu2GQf7PI&feature=youtu.be...

jaywunder 9 years ago

I don't think I understand what the exact point of this talk was. Maybe the thesis was stated at the end of the talk when he said that he wishes the internet were more like a city rather than a mall. I think the internet can be like a city, and I think a great example of a place where people with conflicting ideas talk together is HN. Sure HN can be an echo chamber at times. But there's quite a few times when people with differing opinions talk about their different opinions.

Also I don't necessarily understand Ceglowski's stance on why we shouldn't use deep learning and should avoid surveillance on the web. I don't take issue with becoming a datapoint in Facebook's web of people because nothing bad has happened or can happen from me giving Facebook my data. When most people speak out about the data that's being collected about Facebook and Google users they say they're "worried about what could happen" but then never list any bad things that they're actually afraid of. The speaker falls to this issue too. Ceglowski says:

>I worry about legitimizing a culture of universal surveillance.

But then never explains what bad could happen from legitimizing that culture. Maybe I'm completely missing the point of the talk? Please explain what I'm missing if I'm actually missing something.

idlewords 9 years ago

The audience for this talk was a bunch of librarians and fellow travelers who are bringing large archives and collections online, often at great expense. I wanted to encourage them to find new, engaged audiences for these collections, rather than fixate on how to analyze them with computers.
With regard to the dangers of surveillance, I've made a sustained argument about this in other talks. It boils down to the data being collected having great power to harm people if it is ever put to malicious use, and a lifespan that exceeds that of institutions we know how to run. My beef is not with the surveillance alone, but with the combination of surveillance and permanent storage.
- jaywunder 9 years ago
  
  Thank you for explaining that! The context is meaningful and makes your talk make sense.
  On the regard of data talking into the wrong hands, I take issue to this argument because it's not a unique problem to personal data collection. Any data could be hacked - bank information, address, whatever. But that doesn't mean we don't use the internet for banking and etc. It means we try to make systems that are difficult to hack. It seems like you'd want data collection not to happen on websites like Facebook and Google, when hacking isn't a unique problem to those websites.
  - idlewords 9 years ago
    
    Here's a capsule summary of what I'm pushing for: http://idlewords.com/six_fixes.htm
    
    iyn 9 years ago
    
    I agree with that, but I have a small suggestion until these points are reality: you can consider adding/enabling SSL/TLS for your blog. Thanks!
    P.S. I really like your posts and your tweets are hilarious, please don't ever stop.
    
    rubidium 9 years ago
    
    Ok, this tweaks my curiosity. Why would one put SSL on a static personal/blog type website (assuming one doesn't care about the google penalty)?
    
    mbrock 9 years ago
    
    Otherwise network operators can sniff, alter, and generally fuck with the integrity of the site and its users.
    
    b2m9 9 years ago
    
    For example, ISPs are not able to crawl your traffic if it's via HTTPS. I've worked on data sets gathered by major ISPs and it's scary how much they know about their users (especially if they also have a mobile phone with the same company). ISPs use such intelligence for personalised marketing (either for their own product catalogue or 3rd parties)
    
    odbol_ 9 years ago
    
    The URL isn't encrypted though, is it? Since there's no dynamic content on the page, they already know exactly what you're looking at.
    
    detaro 9 years ago
    
    The domain isn't, the full URL is. (But content size etc probably still allow identification of an individual page on a small site, and the context of the domain is already valuable)
odbol_ 9 years ago

There's already been studies on how Google can manipulate elections just by reordering search results, or how Facebook can alter your mood by tweaking what goes into your news feed. It's not hard to imagine how bad things could get if that kind of power got into the hands of the next Mussolini or Stalin (which is a very real possibility aka Trump)

marklyon 9 years ago

I provide guidance to attorneys involved in the discovery process; "Technology Assisted Review" is of huge interest to those teams, as it allows them to leverage coding on a small sample of the population across a much larger set of documents. For many cases, the cost and (occasional) time savings is instantly attractive. Sadly, the process is hard to do well. Far too many screw it up in new and amazing ways.

The author's concerns over machine learning are well-founded. The best option I've been able to identify to ameliorate some of the concerns is focusing on the population that will be suppressed. Once the model returns the desired recall / precision, drawing samples from the excluded population with a rigorous acceptance standard can help validate whether you've simply built a model around your biases. Couple that with allowing an opponent to validate a randomly-selected sample and you've cleared up a lot of the uncertainty in the model.

It's not perfection, but perfection is a very difficult standard.

abofh 9 years ago

The issue with that approach is ensuring the suppressed are represented. When it's black vs white, you can oversample one and be done.
However, if there's any winner take all built into the system, there's a strong incentive to not even acknowledging dissent.

pcmaffey 9 years ago

Machine learning does not have less bias than human researchers. It is simply magnified at scale.

And that scale is exactly the state of the internet. There is so much data available to study and understand, that we absolutely need better tools, like machine learning or whatever we want to call it, to help us keep up. Shit's moving faster than our human perception can handle, especially for those who didn't grow up with the internet.

Yes the data analyctic tools we have right now are premature— like fast food to our productized minds— but they will improve rapidly, as our taste for quality improves.

But sure demonizing the things you don't like is one step on the path to learning what's truly valuable.

ohthehugemanate 9 years ago

Bias is a pattern-generation process. Machine learning is a pattern-recognition process. Any bias on the part of the (human) data collection, or the (human) training program author, gets spit out as a "pattern", because it is one. The problem is that it gives the illusion of a bias in reality.
My go-to example is machine learning police enforcement direction, often used as a counter to racially biased policing. This works in any city with a historical problem of racial bias in policework. We give the algorithm all the data we have from the last 60 years of policing this city. Patrol schedules, incident records, arrest records... everything. The computer magically tells us where we should focus our efforts. To the police chief who paid for the system, and especially to the media reporting on it, it looks like a computer is making the decisions without bias. Hooray!
Of course, anyone who's ever worked with machine learning can spot the problem. The data set was generated by racially biased policing. That bias will be reflected in all the records: more arrests for race X, more patrols scheduled through their neighborhoods, more incident reports from those areas. So when the algorithm says "increase patrols in this neighborhood," or "look for people who fit this profile," it is simply synthesizing the patterns from 60 years of racial bias. So the police in LA have a real problem: their "unbiased" computer program is telling them that their criminals look like black people, and they should increase patrols in Compton. So they do, and that data only takes the data further from "un-biased" reality. In fact, the police "black box" is only pointing out a history of racially biased policing. We're relabeling it as recommendations for future behavior.
- yummyfajitas 9 years ago
  
  Of course, in reality this bias can be corrected for. I don't know if specific crime-stat software does it, but it's certainly doable. Here's an example where I solve literally the same problem (better measurement => more events detected) in a different scenario. Is it remotely controversial that I can do this for a sensor network?
  https://www.chrisstucchio.com/blog/2016/delayed_reactions.ht...
  You might also be interested to know that a variety of studies have shown that policing is not particularly biased. Arrest statistics and the like correspond pretty well with NCVS and similar crime victim surveys.
  http://slatestarcodex.com/2014/11/25/race-and-justice-much-m...
coldtea 9 years ago

>Machine learning does not have less bias than human researchers. It is simply magnified at scale.
Scale differences can and often do lead to qualitative differences.
Individual (or aggregate) human researchers are not hooked up in huge services to make inferences and deductions automatically about billions of people.
Besides those machine learning tools, beside the huge data sets, are programmed in their general framework by human researchers, and are given weights, constraints, and fine-tuning by them, so they have both kinds of biases.
>But sure demonizing the things you don't like is one step on the path to learning what's truly valuable.
So, kind of like disparaging via a straw-man a speech that offers detailed argumentation?
- AndrewKemendo 9 years ago
  
  Individual (or aggregate) human researchers are not hooked up in huge services to make inferences and deductions automatically about billions of people
  Yes they (we) are. It's the same data set. TV, movies, papers, internet videos et al. is all the same biased, labeled data that is being fed (watched, listened to etc...) to machines. You automatically make inferences and deduce things about people based on labeling and training of your brain. You're constantly fine tuning by getting new weights about things through interactions with others and media.
  - coldtea 9 years ago
    
    >Yes they (we) are.
    I didn't say researchers and/or individual people are not making such judgements about billions of others (e.g. "the Chinese suck/are great").
    I said they are not "hooked up in huge services" to make them automatically for billions of individuals -- like an ML algorithm used by Google or Amazon or some government agency etc would do.
    
    AndrewKemendo 9 years ago
    
    My point is that it's the same thing. Individuals with outsized power and influence can affect billions of people based on their own judgments and implicitly make those judgments automatically for billions.
    That's what this question is all about - should machine systems be responsible for the kind of sweeping decisions that humans are making on those populations now? Probably.
yummyfajitas 9 years ago

This is simply not true. Most algorithms can and will correct for biases in their inputs.
See this (somewhat technical) article where I go into explicit (simulations in numpy) levels of detail:
https://www.chrisstucchio.com/blog/2016/alien_intelligences_...
The best analogy I've come up with for the non-technical is that algorithms are like humans trying to draw inferences on octopus society. Some octopi might have bias against some other octopi, but it's the height of octopusthromorphism to to expect a human to reproduce that bias.
- wrsh07 9 years ago
  
  This is very optimistic. There are well known and documented cases of ml algorithm bias and its cause [1].
  And it's not surprising that data itself contains some biases from the humans creating it. Suppose police are asking machine learning where more crime is committed - there will be a feedback loop. Where are they currently making more arrests? If they spend more time there, the bias will be exaggerated.
  The op correctly gauges how we should be cautious. Your post, I'm afraid, is misleading at best.
  [1] https://www.google.com/amp/s/www.technologyreview.com/s/6017...
  - yummyfajitas 9 years ago
    
    Of course data contains biases. But again, please read the article I linked; algorithms will have a tendency to correct that bias.
    The examples in the article you link to are not algorithmic bias at all. They consist of:
    1) Humans at Facebook manipulating trending results.
    2) Google's keyword algorithm (accurately) reflecting the fact that people with black names are more likely to have arrest records.
    Lets distinguish "bias" from "accurately learning things you wish it wouldn't learn" or "accurately learning things you wish weren't true."
    None of what I'm saying is remotely controversial. If I told you statistics could detect and correct bias in a mobile phone compass, you'd just think "cool stats bro". Is this article remotely controversial? https://www.chrisstucchio.com/blog/2016/bayesian_calibration...
    The specific feedback loop you describe - variable detection probability => variable # of detections - can be directly mitigated. For a non-controversial example drawn from sensor networks (sensors report events with a delayed eraction, the longer you wait the more events you detect), see here: https://www.chrisstucchio.com/blog/2016/delayed_reactions.ht...
    (You can find similar examples all over the place. I just link to the ones I wrote because they spring immediately to mind.)
    In a compass, a sensor network, adtech or other quant finance, the idea that machine learning can fix biased inputs is not remotely controversial. The concept that statistics suddenly stops working to fix racism is just silly anthropomorphism.
    
    wrsh07 9 years ago
    
    Aha - I think I see our miscommunication. When you say bias you mean statistical bias.
    Yes, machine learning is able to correct for that kind of bias - 538's polls forecast is a good example of that.
    But you don't get to redefine racial bias to be something innocuous. Yes, black names are more likely to have arrest records, but that "fact" is super misleading [1].
    Finally, you're talking past me. I'm not saying that statistics is broken. I'm saying that we should be especially mindful of the OPs point when they say this:
    > So what’s your data being fried in? These algorithms train on large collections that you know nothing about. Sites like Google operate on a scale hundreds of times bigger than anything in the humanities. Any irregularities in that training data end up infused into in the classifier.
    I think the OP author also has a related post about the kind of bias I'm talking about: http://idlewords.com/talks/sase_panel.htm
    [1]: http://www.huffingtonpost.com/kim-farbota/black-crime-rates-...
    
    yummyfajitas 9 years ago
    
    Without getting into a dispute about the definition of "bias", I'm saying that algorithms can accurately measure reality even if input(x=white, all else equal) != input(x=black, all else equal).
    You are saying that algorithms are accurately measuring a reality you wish were different. I don't disagree with this.
    The right thing to do is to actually answer unpleasant moral questions like "if blacks are 4x more likely to be dangerous criminals, what should we do about it?" But I guess overloading the word "bias" is a nice substitute for clearly thinking things through.
    
    eridius 9 years ago
    
    The problem is you're modeling a biased reality. And accurately modeling a biased reality may in many cases accentuate the bias. Take for example the previously-mentioned case of using an algorithm to determine where to focus your policing efforts. If the data you have says that more arrests are done in a particular part of the city, then you'll want to put more police there, right? But areas where there are more police will tend to see more arrests. So the fact that you're putting more police in an area where you see more arrests is just going to make the bias more extreme, causing even more arrests there. This causes a feedback loop. So you may be accurately modeling reality, but you're modeling a pre-existing bias and making it worse. And who knows why that pre-existing bias was even there? The fact that there were more arrests there may not be because that area actually has more crime committed, it could be due to other factors, such as racial profiling by police, and in that case your algorithm is now accidentally racist because it's perpetuating racial profiling.
    
    dragonwriter 9 years ago
    
    The problems are really twofold:
    (1) Defining the proper goals, and
    (2) Measuring the right things (such as the real goals of interest rather than biased proxies.)
    With police deployments, you are assuming the solution (rather than letting your algorithm optimize it) by saying "I want to put more police where more arrests occur". What you really want is probably something more like (the exact goal may be different, of course) "I want to deploy police resources where it will most effectively reduce the incidence of crime, weighted by some assigned measure of severity." Then let your ML algorithm crunch the various measurable factors and produce an optimum deployment to do that.
    (But, then again with that goal -- and similar problems exist with many likely real goals -- you run into the other problem, which is measuring the incidence of crime -- measuring crime reports may be the obvious approach, but there's plenty of evidence that lots of factors can bias crime reports, including communities having bad experience with police being less likely to report crimes.)
    
    wrsh07 9 years ago
    
    Thank you. This is so much clearer than what I was saying.
    As you say, proper goals and measurement can fix a lot of these problems, and I don't think it's obvious that ml algorithms solve either of those
    
    yummyfajitas 9 years ago
    
    I directly addressed this critique two posts up. Why don't you go read that post?
    https://news.ycombinator.com/item?id=12627359
    
    eridius 9 years ago
    
    I did read it, but you're talking about correcting for measurement biases in order to recover an accurate view of reality. But what I'm saying is that accurately measuring reality may in fact be how you get bias, because the very thing you're measuring may be biased. If you're aware the bias exists and have tools that can measure the bias itself then maybe you can correct for the bias, but you can't just expect your algorithm to automatically correct itself in the presence of bias because its goal is to model reality, not to figure out whether there's inherent bias in the thing it's modeling.
    
    yummyfajitas 9 years ago
    
    Here's my concrete claim. Let pp = police presence, then P(crime detected) = r(pp).
    Measured crime = crimes x r(police presence).
    As long as your model is expressive enough to capture r(pp), bias should be detected.
    Fundamentally you are making the claim that there are certain types of variable correlations that are just so evil that no statistical model can possibly understand them. That's a very bold claim; it's essentially the claim that science doesn't work.
    
    eridius 9 years ago
    
    No, I'm claiming that P(crime detected) != r(pp). More police in an area typically means more crime is detected, but that's not the only factor. If you have two areas with identical police presences and identical actual crime rates (as opposed to reported crime rates), the rate of crime detection (as measured by arrests and whatnot) may be higher in one area due to other factors such as racial bias (not just racial profiling, but also things like police letting white people off with a warning where the equivalent black person would be arrested). So you cannot simply correct for this by accounting for the police presence.
    What's more, your data may not even have the necessary info to figure out if there's a bias. For example, what if police are more likely to arrest someone wearing a red shirt than someone wearing any other color shirt? Unless the color of the person's shirt is part of the arrest report, there's no way your statistical model is going to figure out that red shirts affect arrest rate.
    
    yummyfajitas 9 years ago
    
    Your function r = r(pp, red shirts, race of offender, etc) exists. A model of the form a x r + b x something_else + ... will detect the bias you've described, assuming of course the biasing variable is either present or redundantly encoded in the data set.
    We've now established the existence of a statistical model which can detect this bias.
    Now, any other model which is capable of expressing your specific r(p) can do the same thing. The entire purpose of fancy models like random forests is that they can express lots of functions while also being reasonably generalizable.
    If you want to claim that this bias is much more difficult to encode in an SVM than all the other typical hidden patterns, you need to establish that your specific r(...) is somehow vastly more complicated than all the other things that machine learning models regularly detect. That's a pretty strong claim.
    Interestingly, you are now arguing the exact opposite of what most "machine learning is racist" people claim. They typically claim machine learning is racist because algorithms actually learn hidden factors they wish it wouldn't; e.g., a lending algorithm might "redline" blacks who don't pay back their debts. I take it you believe this is highly unlikely, and algorithms can't possibly distinguish between men and women and then show high paying job ads to more men than women?
    
    srean 9 years ago
    
    >Your function r = r(pp, red shirts, race of offender, etc) exists. A model of the form a x r + b x something_else + ... will detect the bias you've described, assuming of course the biasing variable is either present or redundantly encoded in the data set.
    No no no. Had to respond to this because this such a common confusion (not to say that you personally have this).
    That such a model exists within the class of models being used says absolutely nothing about whether the statistical/ML algorithm will find it with any degree of confidence from a sample. The science is still grappling with the question of how to do model selection. There are two, sort of, equivalent class of methods, regularization (this can be a regularization over the dependency structure too, not just a simple penalty) and prior. Its only when you get those right that you have decent chance of estimating well, from reasonable amount of data.
    Short answer: universal approximation property of a class of models says nothing about learnability.
    
    eridius 9 years ago
    
    Regarding your last paragraph, there's two different angles here. The "machine learning is racist" angle I think is quite valid, but covers a different topic than what we've been discussing here. To be more specific, there's two different ways in which we can have racist models:
    1. The algorithm is biased in a way that reflects reality but does not reflect how we wish it to behave. This is the "machine learning is racist" angle. A lending algorithm might quite rightly think that black people are a higher risk, but this is ethically problematic to act on, because denying loans to black people only serves to compound the social problem (even though it may make financial sense for your bank).
    2. What I'm arguing is that we can have racist algorithms due to the fact that the data itself may be biased in a way you're not aware of. To take the red shirt example, something I forgot to say before was that if, say, a fad spreads among the black community of wearing red shirts, then you're going to see an uptick in arrests of black people, but your algorithm won't be able to figure out that this is actually due to arresting red-shirted people, which means it will believe that black people in general are more likely to be arrested.
    
    yummyfajitas 9 years ago
    
    Here's my point.
    (1) is only possible if your data provides access to the biasing variable, perhaps via redundant encoding. This is the standard critique folks make.
    As per (1), the biasing variable is available. Now if the algorithm is expressive enough to describe the functional form of the bias (e.g. the bias is quadratic, and the model includes quadratic terms), it will fix that bias.
    You're right that there are lots of hidden variables that we can't use in a predictor. Murderous intent and mafia membership are also not available as predictive factors. You could build a more accurate model if you had that data. So what?
    
    eridius 9 years ago
    
    The problem with (2) isn't just that your model isn't as precise as it could be, it's that your model may be inadvertently biased because all of the data that it was fed was biased. This comment (https://news.ycombinator.com/item?id=12625917) gives a good example of that one. No amount of expressivity in the algorithm will account for the fact that the Friendface model (read the comment) was trained on a predominately white userbase versus FaceSpace's model which is trained on a predominately urban black userbase.
    
    srean 9 years ago
    
    Are you saying that it can form a good estimate of the conditional probability ? I can believe that if the sampling process preserves the conditional.
    Otherwise one would have to make assumptions about (or in other words, model) the corruption process. The bias compensation machinery then has to be deliberate, wont happen on its own.
    Some sampling processes do not modify the conditional. In those cases no special machinery would be required.
    
    yummyfajitas 9 years ago
    
    tOne approach is to directly model the corruption process. Being the model-based-Bayesian guy I am, this is something I like to do.
    But if your model is sufficiently expressive you don't need to explicitly build or model the corruption process. In the example in my linked blog post, test scores might be biased against blacks. But race is also redundantly encoded, so the algorithm has enough information to fix the bias completely by accident.
    Fundamentally what I'm saying here is that bias is a statistics problem and has a statistics solution. Insofar as your complaint is algorithms finding the wrong answer, the solution is better stats.
    And nothing whatsoever that I've said here would be remotely controversial if the topic were remote sensing.
    
    srean 9 years ago
    
    > But if your model is sufficiently expressive you don't need to explicitly build or model the corruption process
    This is the claim that I am having trouble with.
    Say I have two random variable X,Y with some joint distribution. If a corruption process can mess with the samples drawn from it, I cannot see how it could possibly recover either the joint or the conditional.
    Are you saying that the corruption is benign like missing at random or missing completely at random ? Then its much more believable.
    
    yummyfajitas 9 years ago
    
    So we both agree that if the bias is linear, and your model is linear, you capture it. Similarly if the model involves interaction (score x is_black), and you include linear interaction terms, you'll also capture it.
    Now the question arises; what if things are more complex?
    In real life they always are; both your biasing factor and the rest of the model. So we've cooked up all sorts of fun models like SVMs, random forests and neural networks to analyze such complicated models and find hidden features and relations that we didn't think of. Bias is one such feature.
    If I built an algorithm that learned to display different ads to mobile and desktop people (i.e., treat mobile "time on site" differently from desktop "time on site"), would you be surprised by this?
    
    srean 9 years ago
    
    That makes it clearer. I got thrown off by the claim that a standard algorithm will be able to de-bias if no de-biasing machinery has been built into it. BTW the machinery may be implicit in the choice of the model.
    Simple toy example: say Y is a threshold function of X + high variance noise. I draw samples from this and scale down all y_i's that exceed the (unknown) threshold. In other words my corruption process is dependent on X. We can make it depend on Y too. These would require explicit modeling. Just throwing a uniformly rich class of P(X,Y) wont by itself fix this. We have to carve that space of P(X,Y) with the knowledge of possible corruption process to get a good model of the behavior before the corruption is applied.
    BTW we have gone way off tangent, but that was a good conversation.
- mrow84 9 years ago
  
  To correct biased measurements (in a careful way) you need
  1. Enough knowledge about the structure of the bias to be able to devise a model for it.
  2. Some measurements from which to fit the model, with errors that are uncorrelated with the errors in your original data.
  These things are not always easy to obtain, even in relatively mundane settings. It is also a distinctly non-automatic procedure - it requires someone to decide that a bias exists, to model it, obtain the relevant data, and fit the bias correction model, all before they can begin to obtain unbiased (or probably just less-biased) measurements.
  - yummyfajitas 9 years ago
    
    I'm not making the claim that an algorithm magically fixes everything. I'm claiming that sometimes they do which makes bias less likely to be present in the ML model.
    You don't need a human data scientist to decide bias exists, model it and fix it at all. If you read the post I linked to, you can observe a synthetic example of linear regression (with redundant encodings) accidentally fixing bias.
    So yes, if your model is expressive enough and you have sufficient data, it will automatically fix bias. Is it really shocking that an algorithm which is good at finding hidden patterns will find a hidden pattern?
    
    mrow84 9 years ago
    
    I don't really understand that claim. You are explicitly adding a bias that is linearly dependant on your race variable, and then allowing your regression to recover that bias by introducing noisy measurements of race (which you as the modeller knew was the thing causing the bias). As you say, that is unsurprising.
    That result does not, however, address my point, which is that if the structure of the bias is difficult to understand, or perhaps even just difficult to model, and if relevant measurements (with errors that are uncorrelated with your original errors) are unavailable, then bias correction is essentially impossible.
    
    yummyfajitas 9 years ago
    
    The point is that the bias is linear, and my model is linear, so the model fixes things. The example is synthetic (so we could know what the right answer is and check if we recover it) so of course I put everything in.
    In the linked article, I explicitly reference a real world case where the same linear model was used to discover that grades and test scores are biased in favor of blacks: http://ftp.iza.org/dp8733.pdf
    In more complicated situations, the bias would need to be amenable to detection by a neural network, an SVM or random forest. The entire purpose of models like this is that lots of hidden patterns are detected.
    Even if relevant measurements are unavailable, one can use redundant encoding to fix bias. Delip Rao explains redundant encoding here, for example, though he is more concerned that ML models might learn facts he wants to remain hidden: http://deliprao.com/archives/129
    
    mrow84 9 years ago
    
    To remain with the example in your blog post, your model fixed things because the implicit bias model was correct (linear dependance on race), and the data were available, either directly (via the race variable) in the "What if measurements are biased?" section, or indirectly (via the noisy redundantly-encoded race variables) in the "What if we scrub race, but redundantly encode it?" section.
    In the first of those two sections you yourself note how bias correction is not possible without the relevant data: "If we scrubbed the data this result would be impossible. Running least squares on scrubbed data yields alpha = [ 0.29878373, 0.30869833] - we can't correct for bias because we don't know the variable being biased on."
    I'm not disputing that bias correction is possible, only that it can be much harder than you seem to be implying, with statements like "Most algorithms can and will correct for biases in their inputs.", and "Of course data contains biases. But again, please read the article I linked; algorithms will have a tendency to correct that bias."
    I have some experience with bias correction in (ocean) weather forecasting, and in that domain there were problems both with the difficulty of modelling the bias structure, and with obtaining measurements reliable enough for bias correction.
    
    yummyfajitas 9 years ago
    
    What I'm disputing is this:
    Machine learning does not have less bias than human researchers. It is simply magnified at scale.
    This is fundamentally wrong. Given data on the biasing factor, most algorithms will try to use it and improve things. Sometimes information is unavailable. On net there is a reason why many algorithms will reduce bias, and no particular reason why they would increase it equally in the remaining cases.
    
    dragonwriter 9 years ago
    
    > Given data on the biasing factor, most algorithms will try to use it and improve things.
    Unless the bias is in what they are designed to optimize for (either because the goal is explicitly biased or because the operationalization of the goal into a concrete measure is, whether intentionally or not, biased), in which case they will obviously reinforce it.
Singletoned 9 years ago

> Machine learning does not have less bias than human researchers.
You are right that machine learning gains bias from the humans that created it, but unless they managed to transfer 100% of their biases to it, it will always have less bias.
- chewxy 9 years ago
  
  We impart our bias on ML algorithms by choosing what data to use to train the AI on.
  The problem, I think is one of self-selection.
  Consider two hypothetical social networking websites - Friendface and FaceSpace. Friendface's userbase are mostly white users, while FaceSpace catered mostly to urban, black populations. And it would make sense too - you would only join a social network if your friends are on it. If you're white, chances are the majority of your friends are also white. And vice versa.
  So Friendface is a lot more active on their ML front. The problem is when Friendface releases their data - because they're more active on the ML front, and ML scientists love to not have to collect their data, what happens is more and more models are trained on the Friendface data and more and more models are being optimized based on Friendface data. Apparent "structural" racism happens. Tumblrinas all pounce on it as if it were the biggest oppressive struggles of their lives.
  A very cute thing to imagine in this scenario would be to imagine FaceSpace suddenly got good at NLP, and open sources their statistical language model. Recall that FaceSpace users are more likely to use AAVE in their communication, so what do you think the statistical language model would be?
  In the original article, Maciej mentions "going to the community" - using crowd wisdom to handle these sorts of thing, and preferring to use open standards as opposed to silo'd standards (like writing your blog post on facebook... why??!!). While that sounds like a good idea, like I've mentioned in my other comment, it also sounds tiring as hell.
  Firms act rationally (more or less)... ML is driven by huge companies with huge datasets. Why would they need to prune external datasets when they could just do their ML research with a few SQL queries?

1024core 9 years ago

"The names keep changing—it used to be unsupervised learning, now it’s called big data or deep learning or AI"

Um, I'm sorry, but unsupervised learning and deep learning are not the same.

omginternets 9 years ago

The point is that these phrases become buzzwords, at which point deep-learning vs AI becomes a distinction without a difference. In the mainstream media you can safely replace both of these terms with "statistics" and not alter the meaning of the sentence.
In other words, terminology can be used to make precise, meaningful distinctions, or it can be used to embellish.
roel_v 9 years ago

Technically sure, but outside of the world of statistics, nobody makes that distinction or cares.
idlewords 9 years ago

What's the distinction?
- absherwin 9 years ago
  
  Unsupervised refers to whether or not the dataset is being trained against anything. Think about the difference between: How many people will view this webpage? Divide these pages into 20 clusters? The first is supervised. The second isn't.
  Deep learning refers to a particular type of a particular learning technique: Specifically a neural network that has many hidden (intermediate) layers. Deep learning can be used for either supervised or unsupervised learning.
  - idlewords 9 years ago
    
    Thanks for the explanation! I was remembering what flavors of snake oil were being peddled to people in the humanities in the early 2000's, and didn't mean to muddy the waters around technical terms of art.
  - gabrielgoh 9 years ago
    
    I agree with your sentiment, it feels out of place because deep learning, AI and big data are buzzwords, but unsupervised learning is a rather technical term in machine learning referring to a very specific class of problems.
    
    dkersten 9 years ago
    
    Are they really buzzwords? To me, they have rather particular meanings (although I guess others may feel differently):
    Deep learning: a particular type of artificial neural network with many hidden layers (and the associated tech to make this work/trainable)
    AI: The field of computer science which aims to make computers smarter. Like most fields, there is much overlap with others, for example, statistics.
    Big Data: A buzzword. About the best definition I can find is anything which has the 3 V's: Volume, Velocity & Variety. In general, outside its use as a buzzword, I think big data is generally thought of as "when you need a distributed system to process your data", be it because of volume, velocity or variety.
    Supervised and unsupervised learning: whether or not you require example data for training
    Machine learning: some people say its the subset of AI that deals with statistical methods, other people say its just another word for AI.
    
    apathy 9 years ago
    
    Specific only in that the categories aren't supervised.
    Furthermore, suppose you have labels for some but not all points on your data (i.e. your model is designed to be robust in the face of things it hasn't been trained for). There are a nontrivial number of people who work on either side of the "semi-supervised" divide, e.g. clustering with examplars or pulling out the generative model for a discriminative task. Personally I like these better, as they're more akin to what people seem to actually do (encounter new things and try to make sense of them).
    Anyways. If you look at the delta in performance between "old" techniques like random forests or gradient boosting vs. deep convolutional networks, it tends to be quite small until your datasets grow to very large sizes. For things like images that's not much of a problem. For things like rare diseases it's a huge problem.
    
    moultano 9 years ago
    
    Deep learning isn't a buzzword, it has a fairly precise definition. It describes a particular class of algorithms that happens to be the state of the art for many problems.
yarou 9 years ago

Yeah, but garbage in is still garbage out.
Which is the point he was trying to make.
- aub3bhat 9 years ago
  
  It's not "garbage" it called science & mathematics, those terms have "meaning", and have lead to progress in hard long-standing problems, which have in turn lead to billions of dollars and millions of man-hours allocated to understanding and using them.
  Just because you lack ability to understand nuances of something does not makes it "garbage".
  - lliamander 9 years ago
    
    Respectfully, I believe the "garbage" to which yarou was referring to was not the algorithms, but simply to the data that is being fed into these algorithms.
    The point being that, no matter how sophisticated these techniques are, the quality of the results is constrained by the quality of the input data.
    As Charles Babbage said: "On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."

dvdplm 9 years ago

Idelwords is one of those blogs worth dropping everything you do and just take a deep breath and dive in and revel in the joy of clear thought expressed through clear prose. Love it. Thanks.

sdenton4 9 years ago

"...Dim witted grad student that you can't really trust..."

Reminds me of the phrase "graduate student descent" for training neural networks...

I've been noticing more casual dismissiveness towards grad students lately. They are certainly often treated as the grunt laborers of academia, in areas where career prospects are downright stupid. I generally feel it would be more productive to at least pretend that they're being trained to be independent, aggressive researchers in their own right, though.

gabrielgoh 9 years ago

grad students are put in the same category as interns and teenagers, a naive type of person still in the making. i dont think there's any ill will intended.
- eanzenberg 9 years ago
  
  You are dim witted.
  No ill will intended.
  - elmigranto 9 years ago
    
    There's a difference between metaphors/jokes and insults addressed at you personally.
  - edmccard 9 years ago
    
    >No ill will intended.
    Is that a new short version of Hanlon's Razor?
vilhelm_s 9 years ago

Yeah, I also thought this metaphor seemed weird. "Now you need some adult supervision in the room", but grad students are, in fact, adults. And not particularly dim-witted, as a rule.
nullc 9 years ago

Considering the working conditions and prospects for the future that graduate students face, one /could/ argue that that a selection bias should be expected there.
> to at least pretend that they're being trained to be independent, aggressive researchers
But that is the issue, isn't it-- it would be pretending.
munin 9 years ago

looks like we found the grad student!
but seriously, as a grad student, absolutely no one gives us respect. not our peers, our bosses, or society. why would you expect some random on the internet to do better?

Esau 9 years ago

"And this time it's not the government, but the commercial Internet that has worked so hard to dismantle privacy."

So true.

udbaOP 9 years ago

I'm currently applying for co-op jobs (internships) and while trawling the university job board I've seen many positions requiring big data this or machine learning that.

What's not clear to me is why companies who don't seem to have any need for machine learning team (i.e. a subscription box company) are looking to hire one.

Surely part of this can be pinned down to the hype associated with ML that may well die out, but the proliferation of these tools doesn't bode well for Maciej's dream of a weird, creative, and interesting internet.

teej 9 years ago

These companies aren't looking for someone to develop new machine learning techniques, they are just looking for someone who can slap together existing utilities to meet their goals.
Companies that run on subscription literally live and die by their churn rate. It is both feasible and reasonable for a subscription box company to hire someone to use machine learning to build a predictive churn model. That may seem trivial to you but that's the reality behind those job posts.

morecoffee 9 years ago

Giving up control is harder than the author makes it seem. It isn't so much that you give it up, but that you give _to_ someone else. Picking that someone else is extremely difficult and a wrong choice will destroy your community.

Using machine learning on the other hand is a safe bet. It is much easier, I would assert, to write machine learning code to organize data than to curate a community of humans to organize data. The ML approach will do pretty good even if it isn't the best, which is why it's what everyone is switching to.

Keeping with the author's example, is it easier to organize erotic fanfic with a computer, or enable a community to do it without spiraling out of control?

crzwdjk 9 years ago

For example, it is clearly easier to write a machine learning system to find interesting articles and highlight insightful commentary, compared to using something so crude as a group of people on a website collectively voting on which stories and comments they like... wait a minute...

skybrian 9 years ago

If you think the Internet is as safe and controlled as a shopping mall, you probably should be reading Krebs on Security more.

People tend to move towards the more mall-like areas of the Internet due to spam and abuse that they don't want to deal with. This can be low-level stuff, or (as in the cases of Kreb himeself) sometimes the attackers get out the big guns, and you need to run for cover.

And that's why we're hanging out here, after all, and not in some unmoderated forum. And even here, post on certain subjects and conversation quickly degenerates.

I think we do need a wider variety of spaces to hang out, though. No set of rules works for everyone. And if you do want 4chan, you know where to find it.

aab0 9 years ago

> If you think the Internet is as safe and controlled as a shopping mall, you probably should be reading Krebs on Security more.
That's an amusing comparison, given how much of Krebs focuses on offline ATM skimming, copying credit cards at point-of-sale terminals, hacking major retailers's CC databases, and using stolen cards at retail and mall stores to cash them out...

erichocean 9 years ago

> Publish your texts as text. Let the images be images. Put them behind URLs and then commit to keeping them there.

I sounds like he's saying ephemeral content is worthless and should be shunned.

I, and hundreds of millions of others, disagree. You want a bland, awful, boring society? Easy: make everything you do stick around forever—like a promise. And then watch the world self-police as the lifeblood drains out of it.

You'll get…Facebook. No thanks.

idlewords 9 years ago

The audience for this talk was people with very large collections they're bringing online. I was trying to encourage them to avoid exotic formats, custom plugins, custom software (shudder) when they put this material online, and make them web accessible.
For example, here is three quarters of a PETABYTE of historical American newspapers: http://chroniclingamerica.loc.gov
- erichocean 9 years ago
  
  Okay, that makes sense. I 100% agree that bringing collections online in exotic formats is a terrible idea.

goldmar 9 years ago

Thank you for this article. I have really enjoyed your writing style, especially the creative metaphors you're been using :)

TranceMan 9 years ago

I have been having similar thoughts for a while: https://news.ycombinator.com/item?id=10937201

qwertyuiop924 9 years ago

>the Internet is a shopping mall. There are two big anchor stores, Facebook and Google, at either end. There’s an Apple store in the middle, along with a Sharper Image where they are trying to sell us the Internet of Things. A couple of punk kids hang out in the food court, but they don't really make trouble. This mall is well-policed and has security cameras everywhere. And you guys are the bookmobile in the parking lot, put there to try to make it classy.

It's already been mentioned, but this guy needs to get out a bit more.

The internet is a city. There's the specialist shops (HN), the bustling malls (Reddit, YT), the shady back alleys (4chan, 8chan etc.), the historical districts (Usenet, Archive.org), the cafes (IRC, ICQ, Slack, etc.). To their credit, the author is more knowledgeable than most, however.

I see so many dismiss the internet as just Facebook, or YouTube, discuss trolling as if it's a single phenomenon, and it's a recent thing, associated with Social Media. So many think that there's an internet culture: there isn't: there's a set of almost infinite numbers of overlapping, interlinked cultures. I can even map out the origins and historical influences of a few. There are even a few who think that social media sites are good forums of discussion. The poor sods: the Usenet was a better discussion forum than Facebook ever was, and the Usenet's not that great.

If you really want to see what the internet is like (that isn't advice for the author: I'm pretty sure the mall analogy doesn't encompass his internet experience, and is merely an analogue I find odd), explore. See it all, in all of its weird, wacky, zany, jokey, serious, offensive, manic, smart, stupid, brilliant, insane glory. I promise you, you won't be dissapointed.

People ask me why I'm not on social media. It's because social media is boring. Unlike Reddit, 4chan, and the rest, not much interesting happens. Unlike HN, I'm not likely to be intellectually stimulated, or learn something new. Unlike static sites, I don't get to see that kind of wild creativeness that personal webspace tends to invite in hackers, nerds, and others who know what makes the web tick. I don't want to see what you ate, I don't want to see your cat, I don't want to hear banal details about your everyday life. I want to hear something intersting, new, and original. I want to hear the next Ze Frank, or Tom Ridgewell, or Simon Travaglia, or Steve Yegge, or RMS, or PG, or Ryan Dahl, and you can bet I won't on a site with a signal:noise ratio that high.

People also ask why I'm fascinated with the internet. My response is, why wouldn't I be? It's a catalogue of decades of human creativity and interaction. It's open mike night at the largest club in the world, which is also a discussion forum, and a shady back alley, and a convention. It is - to borrow and butcher Sir Terry's words - like being blindfolded and drunk at several different parties at once.

But, in what it rapidly becoming the sign-off on my incoherent, long-winded ramblings that are really only tangentially connected to the topic at hand, maybe I'm just totally mad.

EDIT: tried to clarify that I wasn't trying to insult the author. Not my intent, but it seemed to come off that way. It still does, but less so, and I prefer not to edit my old content too much. Also, I just checked out pinboard. Pinboard is amazing, and I am impressed.

Basically, don't take this as anything more than a tangential, incoherent ramble started by an analogy the author used which I found unrepresentative. Because that's what it is.

joshu 9 years ago

"Needs to out a bit more" given the context is hilarious. I bet a cool $20 he is more widely traveled than you - both physically and digitally.
- qwertyuiop924 9 years ago
  
  Indeed, as would I, on further reflection. I wrote a postscript at the bottom to this effect, more or less.
  I write most of my HN comments in the spur of the moment. As a result, they're often inaccurate, idiosyncratic, poorly explained, or just weird. If anybody asks, I ususally try to clear up any confusion.
  This isn't necessarily a good idea, but if I thought too much before I spoke, beyond a cursory look to see if I'm violating the rules, I'd be to afraid to post anything interesting, or anything at all beyond polite agreement with everybody, which is so very dull, don't you agree?
  2933 votes and countless interesting discussions later, seems to have worked our okay for me.
  - AceJohnny2 9 years ago
    
    Maciej/IdleWords is a bit of a sacred monster around here. He loves to hate on the HN crowd, and he definitely has a dim opinion of the VC/get-rich-quick internet schemes that one could characterize the Valley for. He provides a great reality check for the kind of internet bullshit that flies around a lot.
    He's featured here frequently: https://news.ycombinator.com/from?site=idlewords.com
    On a lighter tone, I highly recommend his "Argentina on Two Steaks a Day" [1] and "The Alameda-Weehawken Burrito Tunnel" [2], each of which had me laughing harder than anything else in my life.
    [1] http://idlewords.com/2006/04/argentina_on_two_steaks_a_day.h...
    [2] http://idlewords.com/2007/04/the_alameda_weehawken_burrito_t...
    
    qwertyuiop924 9 years ago
    
    Ah. That explains a lot. He's definitely well-spoken. And I don't mind people criticizing the VC crowd. I don't people criticizing stuff in general, so long as it's well done.
- fatbird 9 years ago
  
  The current blog entry on the idlewords front page is his trip report on visiting McMurdo Station, Antarctica.
joepie91_ 9 years ago

As somebody who has been involved in the operation of various things in the "shady back alleys" category, I can assure you that they do not have an established place on the internet anymore. It's a constant battle of keeping things online, with pretty much every service provider involved looking for a way to kick you out. You may not see this as an end user, but it is exceedingly difficult to host something that goes outside of the norm.
The idea that the internet is a city might have been true 10 years ago, but it is definitely not true anymore now. The default response to "I need an X" is "just do it on Facebook", and there are entire swaths of content that just don't have a place anymore on the web, "thanks" to ever-increasing enforcement of arbitrary moral guidelines and growing monoculture.
If the internet were to be described as a city, it'd be a gentrified city where most of the artists have long been chased away by ever-increasing rents.

curuinor 9 years ago

Rather hilariously, deep frying is already a term of art in ML, of course in a radically different setting. Deep fried convnets (https://arxiv.org/abs/1412.7149).

argonaut 9 years ago

One (not especially widely read) paper's title can hardly be called a "term of art" in ML.

aub3bhat 9 years ago

Frankly as a grad student (The kind that the author apparently considers "dim witted"), the entire article is meaningless babbling without any underlying theme.

I wonder if the author truly understands "Machine Learning", what are his qualifications? A degree in Art History, and some "programming experience" aren't very assuring. E.g.

>> "The names keep changing—it used to be unsupervised learning, now it’s called big data or deep learning or AI"

WTF?? The author should enroll in a beginner Machine Learning course on Udacity or Coursera before making philosophical statements about fields he has zero clue about.

It seems the only skill the author has is piecing together meaningless arguments that appeals to average HN users incapable of distinguishing between informed opinions and pseudo-scientific rants. Hell at least bad graduate students have to give examinations, read papers and make original contributions that get peer reviewed (otherwise they fail/get-kicked-out/drop-out). Not like this guy who does not understands difference between "supervised" and "unsupervised" machine learning, yet feels comfortable in making "prophetic" statements about machine learning.

Also

>>> "These techniques are effective, but the fact that the same generic approach works across a wide range of domains should make you suspicious about how much insight it's adding."

What does he means by "same generic approach". If we assume he is implying specific algorithms then we have a good "No free lunch" theorem that shows that a single algorithm is not effective across all domains. Now if by "generic approach" the author mean "machine learning" in general then its as ridiculous as saying

"Mathematics is effective, but the fact that the same Mathematical approach works across a wide range of domains should make you suspicious about how much insight it's adding."

The entire article is filled with "truthiness" and "feel-good" statements, which fall apart on closer examination.

idlewords 9 years ago

My degree was in studio art, not art history.
- aub3bhat 9 years ago
  
  Those two are still orders of magnitude closer to each other when compared to difference between unsupervised learning with deep learning.
  - detaro 9 years ago
    
    As someone with not particularly deep knowledge in either area, that admittedly sounds a lot like "no, but the differences between subfields in MY subfield are way more important than all that stuff over there", which is similar to what you claim the article does. They are important once you care about any details, but not for just describing changing fashions.
    So I'm curious to hear a good explanation for that assertion, founded in knowledge of both areas.
bdr 9 years ago

The difference doesn't really matter in context. You're fixating on a small part of the article that isn't important to the main thread.
- aub3bhat 9 years ago
  
  Its not a "small part", its a basic litmus test. The four terms are completely different from each other, and are not names of methods.
  Unsupervised learning: Learning without a set of labels.
  Big Data: Collecting / using large amount of data.
  Deep Learning: Complex, multilayer representations which perform better than shallow/linear representations.
  AI: Artificial Intelligence, an overarching subject or grouping of subjects involved in building intelligent systems.
  Can you imagine someone talking about space exploration while making a statement such as
  >> "The names keep changing—it used to be black holes, now it’s called radio telescope or reusable launch system or Astronomy"
  Thats how ridiculous the original statement is.
  - elmigranto 9 years ago
    
    It's important to understand that this is not a technical talk/article and providing those examples in a sense "there are data, people analyze it, here are some stuff you might've heard" is fine.
    You wouldn't complain that someone mentioned astronomy and music as an example in the talk about education, even though those are quite different disciplines.

Settings

Deep-Fried Data

Keyboard Shortcuts