An Intuitive Explanation of Convolutional Neural Networks (2016)
ujjwalkarn.meHow do CNNs work when the output is multiple categories? For instance, in the same image is a cat and a dog and a car. What's the architecture look like - multiple CNNs, each that can predict one category? Or does one CNN have multiple outputs and if the score > threshold, add that category to the list shown to the user?
Also, how do CNNs draw a box around the target in the image?
First question: The network is trained to recognize a fixed set of outputs. That's what makes it a classifier -- it classifies an input into a single output. It does this by giving each output possibility a score, and the highest score is its guess for what the original image is. So if I have a network that I train to recognize cats, dogs, and cars, and I get an output like {cat: .13, dog: .85, car: .02} Then the input was most likely a dog. The network calculates all of those values simultaneously.
You can, of course, tell the network to output whatever you want: all of the guesses, best guess, top five guesses, all guesses over a threshold, etc.
Note, this is a gross oversimplification, but it gets the general concept across.
The first question has been widely answered, so let me jump to the second question -- bounding boxes.
In the past, some people did this inefficiently by just sliding a window across the image and using the same classifier you'd use for the first problem. But this is inefficient, and different sizes make it more inefficient. So the best solution is to use an "Object Detection" network, look into SSD or YOLO to see an example of this.
What's the architecture look like - multiple CNNs, each that can predict one category?
Effectively. That's what a DNN is, ANNs with multiple inference layers each one gives their "highest probability" and then the client/system sets the weight threshold for what is returned.
> Parameters like number of filters, filter sizes, architecture of the network etc. have all been fixed before Step 1 and do not change during training process – only the values of the filter matrix and connection weights get updated.
Is this just the article's over-simplification or are these values really just randomly selected?
These are called hyper parameters (filters, filter sizes, stride length, pooling function, activation function, and a whole host of others not mentioned in this article). They are chosen "randomly" in the sense that it isn't an exact science, ie there is no "right" answer. However, intuition and experience are used as a guide to select reasonable values.
The values in the filter matrices and the weights and biases of the fully connected layers are truly random though. They are often initialized with Gaussian random values. Sometimes they are just initialized as all 1's, or 0's. Again, there's no "right" answer (there is probably research out there that recommends one initialization approach over another). These are the values that are trained using gradient descent.
I actually don't think this is a good explanation at all. I'm not saying it's badly written, just that it's not a good explanation for the stated purpose (serving as an intuitive explanation).
To this point, the article is certainly NOT intuitive if you don't already understand image convolution. The explanation is also very long and rambling. While I understand the author has made an effort, I don't think the article really presents the subject matter in a new way: I can learn all of this elsewhere. This is a common problem when people write about complex subject matter without fully understanding the knowledge gap between teacher and audience.
If I were the author, I might try to read up on technical communication and spend some time figuring out how to correctly simply something. As it stands, this article using the typical strategy of information hiding to simplify the subject matter. The problem is that information hiding doesn't doesn't work very well unless it is expertly done. I do like the animation, but again, it only serves to show how image convolution works, and doesn't actually teach us anything about a CNN.
I would suggest the author break the document into three separate sections, the first being very simple (maybe start with the part that says 'images are just matrices') and then add more details in each section. The final section would have a lot of detail. That way you counteract the information blindness that occurs from simplification by providing the information later.
Otherwise, this article is really more of a data dump than an intuitive explanation, and since it doesn't really teach us anything we can't learn elsewhere, I don't see what it contributes.
A cleaner explanation, expertly prepared, could really elevate the effort that went into this.
Jesus, chill. I am reading it, (and I know nothing about CNN's), and learning about what do I need to read first about them. The author makes it clear on few things that you need to read before hand.
I think it is a good article/blog post, (thanks dude, whoever you are that wrote it).
You on the other hand didn't give any better alternatives on your "rant".
The comments are for discussing HN submissions. As written, the article is yet another data dump on CNNs. There are a lot of these on the web already, and I don't think this explanation is better than what already exists.
I stand by my comments.
> You on the other hand didn't give any better alternatives on your "rant".
I don't have to provide better alternatives. Note that my response did provide suggestions on how to improve the article.
It's immensely clear...
man i came back after 2 days just to comment on this: you're high. the article is crystal clear and intuitive. he covers each layer's design, purpose, and effect in intuitive terms.
you on the other hand haven't said literally anything except vaguely criticized. look i'll show you how it's done:
>The explanation is also very long and rambling. While I understand the author has made an effort, I don't think the article really presents the subject matter in a new way: I can learn all of this elsewhere. This is a common problem when people write about complex subject matter without fully understanding the knowledge gap between teacher and audience.
these two sentences have nothing to do with each other: that the explanation isn't novel has nothing to do with elided gaps between expositors and readers (wherein usually the exposition is too complex, not too simple as you've confused it).
>If I were the author, I might try to read up on technical communication and spend some time figuring out how to correctly simply something.
vague. read from where? which chapters? simplify which parts?
>I do like the animation, but again, it only serves to show how image convolution works, and doesn't actually teach us anything about a CNN.
it's like you think that one animation should explain the entire CNN. did you actually read the post? that image explain convolutions and is the absolute standard explanation for convolving with a filter/kernel.
>I would suggest the author break the document into three separate sections
better in that at least it's concrete advice. i suggest you include more points like this.
>images are just matrices
are you suggesting the author goes into CCDs? ADCs? now that would be a rambling post.
>That way you counteract the information blindness that occurs from simplification by providing the information later.
that's terrible advice. detail should be evenly distributed through the article. look at any journal article: except for the appendices all of the meat is in the body not in the conclusion.
>Otherwise, this article is really more of a data dump than an intuitive explanation,
a data dump would be just code. this is in fact an intuitive explanation that uses the classification of dogs/cats/boats/bird as the framework, so there's a structure, terms are defined, there's context (lenet etc.), and there are references.
>and since it doesn't really teach us anything we can't learn elsewhere, I don't see what it contributes.
blog articles don't need to be novel.
The article is all right but for newbies reading it; be a little careful. The author is sloppy with terminology in a way that can trip up someone who is just learning. An example being that a Kernel and a Filter are not the same thing.
Can you explain the difference between the two? I'm new to CNNs and have been wondering this myself - this SO answer says that they're the same thing:
https://stats.stackexchange.com/questions/154798/difference-...
Sorry, edited a bit for clarity after thinking some more about it.
The kernel of a filter would be it's impulse response, which is what you convolve by to get the filter response. That's where the sloppy terminology comes from. A kernel though does not need to be a filter.
A kernel is a function whose product maps a point in one domain onto another domain. For example, the Fourier transform has a kernel of e^jwt. The integral (or sum if discrete) of these products over the function is the transform because it maps the entire function into it's new space. A filter is a function typically defined as having product behavior in the frequency(transformed) domain, which is equivalent to convolution in the time(original) domain. A window is a function that has product behavior in the time(original) domain, and thus convolution behavior in the frequency(transformed) domain.
Particularly in linear algebra (matrix math), if something is a kernel function, there are certain mathematical implications.
Another confusing bit here is that the convolution they are performing to project the original function (the larger image matrix) onto the smaller one isn't a proper convolution, it has a hidden window function in the way the operation is being performed to restrict the output to only the fully overlapped area of an otherwise linear 2D convolution. This is typically called a cropped convolution in image processing.
Personally I think you're being a bit pedantic and fuzzy yourself on the terminology. For the purposes of CNN, it's perfectly fine to think of them as the same thing and the kernel in this case is simply not the same as the "kernel" in linear algebra you alluded to. In fact, it's so different, I don't even know why you'd bother to mention it.
anyone happen to be familiar with any uses of CNNs on 1D "images"? (like you'd get from linear image sensors https://toshiba.semicon-storage.com/ap-en/product/sensor/lin... )
i hit up google scholar occasionally looking for references, but literally everything seems to be applying them to 2D images.
Well, what happens if you build a 1D-input CNN in TensorFlow and train it the usual way? Does it work? Seems like it should.
What's even the difference between 1D inputs and 2D inputs? It's all a bunch of numbers anyway. I don't think it really matters if the pixels are arranged (as you see them) in a neat rectangle vs in a straight line. You could take a 2D matrix and enumerate it as a linear string of numbers and it would still be the same matrix, just represented differently. I don't think the CNN cares either way.
I would go as far as saying that the 1D-ness of the input is just "in your head".
I would argue that in a signal (1D) you can expect some sort of relationship between consecutive elements. In an image (in essence a 2D signal), you can expect a relationship between consecutive elements not just on the horizontal, but also on the vertical axis.
If you arbitrarily represent a signal as a 2D matrix, then abrupt changes in the gradient on the vertical axis are meaningless. But the same is not true in an image, which is naturally represented as a 2D matrix. Here, a sudden change on the vertical axis usually corresponds to an edge in the image.
If you represent an image as a 1D array, you throw away spatial information. So I'm not sure about the 1D-ness just being in ones head.
1D signals are typically associated with time series data. There's a ton of work on audio signals, e.g., speech recognition.
edit: I'm not sure if you're asking specifically for examples of CNNs applied to linear image sensor data, or if you're asking whether CNNs have been applied to any 1D input data.
same like 2D, but the convolution layer is a vector. It's similar like the case of CNN applied to NLP, in this case the text is also 1D on input.
This article was very helpful. The animations did wonders to show how the networks iterate.
The computerphile cnn video is quite good.
Of course, andrej karpathys Stanford lecture on the subject is as well.
Breezes right over back-propagation, arguably the most crucial part :/