Keras is the Deep Learning Toolkit You Have Been Waiting For

I remember a time when “learning about computers” invariably started with the phrase “computers only operate on 0s and 1s…” Things could vary a little for a few minutes, but then you’d get to the meat of things: Boolean logic. “All computer programs are formed from these ‘logic gates’…”

I remember a poster that illustrated Boolean logic in terms of punching. A circuit consisted of a bunch of mechanical fists, an “AND” gate propagated the punch when both its input were punched, an “OR” required only one input punch, etc. At the bottom were some complex circuits and the ominous question: “Are you going to be punched?” Because Boston. (The answer was “Yes. You are going to be punched.”)

Anyway, the point is that while there was a fundamental truth to what I was being told, it was not overwhelmingly relevant to the opportunities that were blossoming, back then at the dawn of the personal computer revolution. Yes, it’s important to eventually understand gates and circuits and transistors and yes, there’s a truth that “this is all computers do,” but that understanding was not immediately necessary to get cool results, such as endlessly printing “Help, I am caught in a program loop!” or playing Nim or Hammurabi. Those things required simply typing in a page or two of BASIC code.

Transcription being what it is, you’d make mistakes and curiosity being what it is, you’d mess around to see what you could alter to customize the game, and then your ambition would slowly grow and only then would you start to benefit from understanding the foundations on which you were building.

Which brings us to deep learning.

You have undoubtedly noticed the rising tide of AI-related news involving “deep neural nets.” Speech synthesis, Deep Dream’s hallucinogenic dog-slugs, and perhaps most impressively AlphaGo’s success against the 9-dan Lee Sedol. Unlike robotics and autonomous vehicles and the like, this is purely software-based: this is our territory.

But “learning about deep learning” invariably starts with phrases involving the phrases “regression,” “linearly inseparable,” and “gradient descent.” It gets math-y pretty quickly.

Now, just as “it’s all just 0s and 1s” is both true but not immediately necessary, “it’s all just weights and transfer functions,” is something for which_eventually_ you will want to have an intuition. But the breakthroughs in recent years have not come about so much because of advances at this foundational level, but rather from a dramatic increase in sophistication about how neural networks are “shaped.”

Not long ago, the most common structure for an artificial neural network was an input layer with a number of neural “nodes” equal to the number of inputs, an output layer with a node per output value, and a single intermediate layer. The “deep” in “deep learning” is nothing more than networks that have more than a single intermediate layer!

Another major area of advancement is approaches that are more complex than “an input node equal to the number of inputs.” Recurrence, convolution, attention… all of these terms relate to this idea of the “shape” of the neural net and the manner in which inputs and intermediate terms are handled.

… snip descent into rabbit-hole …

The Keras library allows you to work at this higher level of abstraction, while running on top of either Theano or TensorFlow, lower-level libraries that provide high-performance implementations of the math-y stuff. This is a Keras description of a neural network that can solve the XOR logic gate. (“You will get punched if one, but not both of the input faces gets punched.”)

import numpy as np
from keras.models import Sequential
from keras.layers.core import Activation, Dense
from keras.optimizers import SGD

X = np.zeros((4, 2), dtype='uint8')
y = np.zeros(4, dtype='uint8')

X[0] = [0, 0]
y[0] = 0
X[1] = [0, 1]
y[1] = 1
X[2] = [1, 0]
y[2] = 1
X[3] = [1, 1]
y[3] = 0

model = Sequential()
model.add(Dense(2, input_dim=2))
model.add(Activation('sigmoid'))
model.add(Dense(1))
model.add(Activation('sigmoid'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd, class_mode="binary")

history = model.fit(X, y, nb_epoch=10000, batch_size=4, show_accuracy=True, verbose=0)

print (model.predict(X))

I’m not claiming that this should be crystal clear to a newcomer, but I do contend that it’s pretty dang approachable. If you wanted to produce a different logic gate, you could certainly figure out what lines to change. If someone told you “The ReLu activation function is used more often than sigmoid nowadays,” your most likely ‘let me see if this works’ would, in fact, work (as long as you guessed you should stick with lowercase).

For historical reasons, solving XOR is pretty much the “Hello, World!” of neural nets. It can be done with relatively little code in any neural network library and can be done in a few dozen lines of mainstream programming languages (my first published article was a neural network in about 100 lines of C++. That was… a long time ago…).

But Keras is not at all restricted to toy problems. Not at all. Check this out. Or this. Keras provides the appropriate abstraction level for everything from introductory to research-level explorations.

Now, is it necessary for workaday developers to become familiar with deep learning? I think the honest answer to that is “not yet.” There’s still a very large gap between “what neural nets do well” and “what use-cases are the average developer being asked to addressed?”

But I think that may change in a surprisingly short amount of time. In broad terms, what artificial neural nets do is recognize patterns in noisy signals. If you have a super-clean signal, traditional programming with those binary gates works. More importantly, lots of problems don’t seem easily cast into “recognizing a pattern in a signal.” But part of what’s happening in the field of deep learning is very rapid development of techniques and patterns for re-casting problems in just this way. So-called “sequence-to-sequence” problems such as language translation are beginning to rapidly fall to the surprisingly effective techniques of deep learning.

… snip descent into rabbit-hole …

Lots of problems and sub-problems can be described in terms of “sequence-to-sequence.” The synergy between memory, attention, and sequence-to-sequence — all areas of rapid advancement — is tipping-point stuff. This is the stuff of which symbolic processing is made. When that happens, we’re talking about real “artificial intelligence.” Artifical intelligence, yes, but not, I think, human-level cognition. I strongly suspect that human-level, general-purpose AI will have a trajectory similar to medicine based on genetics: more complex and messy and tangled to be cracked with a single breakthrough.

The Half-Baked Neural Net APIs of iOS 10

iOS 10 contains 2 sets of APIs relating to Artificial Neural Nets and Deep Learning, aka The New New Thing. Unfortunately, both APIs are bizarrely incomplete: they allow you to specify the topology of the neural net, but have no facility for training.

I say this is “bizarre” for two reasons:

  • Topology and the results of training are inextricably linked; and
  • Topology is static

The training of a neural net is, ultimately, just setting the weighting factors for the elements in the network topology: for every connection in the network, you have some weighting factor. A network topology without weights is useless. A training process results in weights for that specific topology.

Topologies are static: neural nets do not modify their topologies at runtime. (Topologies are not generally modified even during training: instead, generally the experimenter uses their intuition to create a topology that they then train.) The topology of a neural net ought to be declarative and probably ought to be loaded from a configuration file, along with the weights that result from training.

When I first saw the iOS 10 APIs, I thought it was possible that Apple was going to reveal a high-level tool for defining and training ANNs: something like Quartz Composer, but for Neural Networks. Or, perhaps, some kind of iCloud-based service for doing the training. But instead, at the sessions at WWDC they said that the model was to develop and train your networks in something like Theanos and then use the APIs.

This is how it works:

  • Do all of your development using some set of tools not from Apple, but make sure that your results are restricted to the runtime capabilities of the Apple neural APIs.
  • When you’re done, you’ll have two things: a network graph and weights for each connection in that graph
  • In your code, use the Apple neural APIs to recreate the network graph.
  • As a resource (download or load from file) the weights.
  • Back in your code, stitch together the weights and the graph. One mistake and you’re toast. If you discover a new, more efficient, topology, you’ll have to change your binary.

This is my prediction: Anyone who uses these APIs is going to instantly write a higher-level API that combines the definition of the topology with the setting of the weights. I mean: Duh.

Now, to be fair, you could implement your own training algorithm on the device and modify the weights of a pre-existing neural network based on device-specific results. Which makes sense if you’re Apple and want to do as much of the Siri / Image recognition / Voice recognition heavy lifting on the device as possible but allow for a certain amount of runtime flexibility. That is, you do the vast majority of the training during development, download the very complex topology and weight resources, but allow the device to modify the weights by a few percent. But even in that case, either your topology stays static or you build it based on a declarative configuration file, which means that whichever route you choose, you’re still talking about a half-baked API.

Bizarre.

An Agile Thought Experiment

A team, unaccustomed to but enthusiastic about moving towards agile methodologies, begins an important project. The project has many facets and a number of strong developers, so it seems natural for the developers to concentrate on a single aspect: Adam is associated with the Widget feature, Barbara focuses on the Sprocket, Charlie with the Doohickey, etc. Charlie is the most familiar with the domain, the legacy codebase upon which the project is being built, etc.

The project is scheduled to last 36 weeks, divided into 12 3-week sprints. At the end of Sprint 8, Peter the Product Owner is satisfied with the feature set of the Sprocket, Doohickey, and other facets, but it has turned out that the Widget feature has been more complex and is clearly going to be the focus of the remaining sprints. Further, Peter has a number of additional features that he’d like to see in the finished project, if possible.

Charlie says “Well, I can add these new features as requested by Peter, but not using Adam’s code, which doesn’t capture several important domain concepts needed to rapidly develop them.” Adam says “These new features are outside the domain I’ve worked in. I think my code is fine, but I cannot guarantee where it will be in the few remaining sprints.” Adam’s code is shelved, the team realigns into a more traditional lead programmer structure with Charlie “doing the hard parts.” At the end of Sprint 12, the project moves across the finish line with an acceptable level of quality.

Would you say that this project was a success because “It delivered customer value on time; it discovered a problem and course-corrected, the switch into a non-agile mode for a short period is fine, like a 2-minute drill in football.”? Or would you say that the project was a failure because “It relied on ‘superhero’ efforts from Charlie at the last minute; it didn’t identify that the Widget feature was not coming together properly and 24 weeks of Adam’s efforts did not go into production.”?

What improvements to methodology and team structure could be made for future projects? Should the team structure themselves more along the lines of a Lead Programmer model (Charlie is clearly the most productive developer) or less (the argument being that the feature-focused structure distributes credit and blame unfairly)?

Advice From An Old Programmer

Advice From An Old Programmer — Learn Python The Hard Way, 2nd Edition.

Programming as a profession….can be a good job, but you could make about the same money and be happier running a fast food joint. You’re much better off using code as your secret weapon in another profession.

People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines.

True that.

MS Concurrency Guru Speaks of “new operating system”

If you are interested in high-performance programming on Windows, you know the name Joe Duffy, whose book Concurrent Programming On Windows is absolutely top-notch.

Today he posted an intriguing notice on his blog “We are hiring.” Check out some of the things he says:

My team’s responsibility spans multiple aspects of a new operating system’s programming model…. When I say languages, I mean type systems, mostly-functional programming, verified safe concurrency, and both front- and back-end compilation….All of these components are new and built from the ground up.

Huh. I’ve argued before that the manycore era requires a fundamental break in OS evolution. Every aspect of the machine has to be rethought; the fundamental metaphor of a computer as a von Neumann machine with, perhaps, a phone line to the outside world has been strained to the breaking point. Forget “the cloud,” we need to think about “the fog” — a computing system where every resource (including resources outside the box at which you happen to be typing) can be accessed concurrently, securely, and virtually.

I don’t think that the OS for the manycore era can evolve from any existing desktop OS. That’s why I think that the “Windows 7 vs. OS X vs. Linux” debates are short-sighted and even the “Windows vs. iOS vs. Android” debates are only skirmishes to determine who has the money, mindshare, and power to eventually win the real battle.

It needs to be said that Microsoft has lots of incubation and research projects whose results either are left to wither or are watered-down and incorporated into mainstream products. But the involvement of a top non-academic thought-leader makes me hopeful that Duffy’s project may have a bright future.

IronPython 2.0 & Microsoft Research Infer.NET 2.2

 import sys import clr sys.path.append("c:\\program files\\Microsoft Research\\Infer.NET 2.2\\bin\\debug") clr.AddReferenceToFile("Infer.Compiler.dll") clr.AddReferenceToFile("Infer.Runtime.dll") from MicrosoftResearch.Infer import * from MicrosoftResearch.Infer.Models import * from MicrosoftResearch.Infer.Distributions import *  firstCoin = Variable[bool].Bernoulli(0.5) secondCoin = Variable[bool].Bernoulli(0.5) bothHeads = firstCoin & secondCoin ie = InferenceEngine() print ie.Infer(bothHeads) --> c:\Users\Larry O'Brien\Documents\Infer.NET 2.2>ipy InferNetTest1.py Compiling model...done. Initialising...done. Iterating: .........|.........|.........|.........|.........| 50 Bernoulli(0.25) 

Sweet

Fast Ranking Algorithm: Astonishing Paper by Raykar, Duraiswami, and Krishnapuram

The July 08 (Vol. 30, #7) IEEE Transactions on Pattern Analysis and Machine Intelligence has an incredible paper by Raykar, Duraiswami, and Krishnapuram. A Fast Algorithm for Learning a Ranking Function from Large-Scale Data Sets appears to be a game-changer for an incredibly important problem in machine learning. Basically, they use a “fast multipole method” developed for computational physics to rapidly estimate (to arbitrary precision) the conjugate gradient of an error function. (In other words, they tweak the parameters and “get a little better” the next time through the training data.)

The precise calculation of the conjugate gradient is O(m^2). This estimation algorithm is O(m)! (That’s an exclamation point, not a factorial!)

On a first reading, I don’t grok how the crucial transform necessarily moves towards an error minimum, but the algorithm looks (surprisingly) easy to implement and their benchmark results are jaw-dropping. Of course, others will have to implement it and analyze it for applicability across different types of data sets, but this is one of the most impressive algorithmic claims I’ve seen in years.

Once upon a time, I had the great fortune to write a column for a magazine on artificial intelligence and could justify spending huge amounts of time implementing AI algorithms (well, I think I was paid $450 per month for my column, so I’m not really sure that “justified” 80 hours of programming, but I was young). Man, would I love to see how this algorithm works for training a neural network…

30K application lines + 110K testing lines: Evidence of…?

I recently wrote an encomium to ResolverOne, the IronPython-based spreadsheet:

[T]heir use of pair programming and test-driven development has delivered high productivity; of the 140,000 lines of code, 110,000 are tests….ResolverOne has been in development for roughly two years, is written in a language without explicit type declarations, and is on an implementation that itself is in active development. It’s been brought to beta in a credible (if not downright impressive) amount of time despite being developed by pairs of programmers writing far more lines of test than application. Yet no one can credibly dismiss the complexity of 30,000 lines of application logic or spreadsheet functionality, much less the truly innovative spreadsheet-program features.

ResolverOne is easily the most compelling data point I’ve heard for the practices of Extreme Programming.

[Extreme Program, SD Times]

Allen Holub sees the glass as half-empty, writing:

I want to take exception to the notion that Python is adequate for a real programming project. The fact that 30K lines of code took 110K lines of tests is a real indictment of the language. My guess is that a significant portion of those tests are addressing potential errors that the compiler would have found in C# or Java. Moreover, all of those unnecessary tests take a lot of time to write, time that could have been spent working on the application.

I was taken aback by this, perhaps because it’s been a good while since I’ve heard someone characterize tests as evidence of trouble as opposed to evidence of quality.

There are (at least) two ways of looking at tests:

  1. Tools for discovering errors, or
  2. Quality gates (they’re one way — are they quality diodes?)

There’s no doubt that the software development tradition has favored the former view (once you’ve typed a line, everything you do next is “debugging”). However, the past decade has seen a … wait for it … paradigm shift.

The Agile Paradigm views change over time as a central issue; if it were still the 90s, I would undoubtedly refer to it as Change-Oriented Programming (COP). Tests are the measure of change — not lines of code, not cyclomatic complexity, not object hierarchies, not even deployments.

(Perhaps “User stories” or scenarios are the “yard-stick” of change, tests are the “inch-stick” of change, and deployments are the “milestone” of change.)

So from within the Agile Paradigm / COP, a new test is written that fails, some new code is written, the test passes — a one-way gate has been passed through, progress has been made, and credit accrues. From outside the paradigm, a test is seen as indicative of a problem that ought not to exist in the first place. The passing of the test is not seen as the salient point, the “need” for (i.e., existence of) the test is seen as evidence of low quality.

In true test-driven development, every pass fails at least once, because the tests are written before the code. What is perhaps not appreciated by those outside the Agile Paradigm, however, is that tests are written that one expects to run from the moment the relevant code is created. For instance, if one had fields for sub-total, taxes, and total, one would certainly write a test that confirmed that total = sub-total + taxes. One would also certainly expect that test to pass as soon as the code had been written.

As is often the case with paradigms, often just realizing that there are different mental models / worldviews in play is crucial to communication.

Update: This relates to Martin Fowler’s recent post on Schools of Software Development.

Microsoft’s StartKey: Computer Environment on a USB Stick. I’ve Experienced This Before and It’s Awesome

StartKey will be a technology that allows you to carry your Windows logon around on a USB keychain. Early reaction is mixed as to the value of this, but I loved something similar when I worked for a company developing software for Sun JavaStation network computers.

With JavaStation’s, you had a smartcard that you plugged in and, after 10 seconds or so, up would come your desktop. Since most of the time you work at your desk, most of the time this was not particularly valuable. But let me tell you — it was fantastic for meetings and presentations. No messing around with cables and display settings, no hand-waving when trying to describe an issue you were talking about when you happened to be on the other side of the office.

The difference is that the JavaStations were uniform hardware, too, and all your software lived on the server (which, it turned out at 7AM the morning of a major trade show, is a single point of failure). While you might have a good experience assuming that a random machine has Office on it (a smile creeps across Microsoft’s face), there would presumably have to be a solution for specialist software such as Visual Studio or Photoshop that could not be assumed to be local.

I would think the problem with that is that although memory sticks are probably getting capacious enough, the bus connection between the memory sticks and the main computer are going to be bottlenecks.