Please: Say “Twin” Networks, Not “Siamese” Networks

I’m a big fan of using a pair of identical networks to create output vectors that are then fed into a network that is trained to judge whether it’s two input vectors are the same or different.

This is a powerful technique for “low k-shot” comparisons. Most ML techniques that are trying to identify, say, “photos of _my_ cat,” require lots of examples of both the general category and perhaps dozens or hundreds of photos of my specific cat. But with twin networks, you train the discriminator portion to tell whether one input is from the same source as the other. Those two inputs are generated from two sub-networks that share the same weights.

Schematic of twin network

A twin network for low k-shot identification

Since training propagates backwards, for the discriminator to succeed, it needs “features that distinguish a particular cat.” And since the weights that generate those inputs are shared, training (if it’s successful) creates a “Features” sub-network that essentially extracts a “fingerprint useful for discriminating.”

To inference with a twin network, then, you hold your target data constant in one network and iterate over your exemplar dataset. For each example in examples, you get a “similarity to target” rating that you can use for final processing (perhaps with a “That looks like a new cat!” threshold or perhaps with a user-reviewed “Compare the target with these three similar cats” UX, etc.).

As I said, I’m a big fan of this technique and it’s what I’ve been using in my whaleshark identification project.

However, there’s one unfortunate thing about this technique, which is that it was labeled as a “Siamese network” back in the day. This is a reference to the term “Siamese twins,” which is an archaic and potentially offensive way to refer to conjoined twins.

It would be a shame if this powerful technique grew in popularity and carried with it an unfortunate label. “Twin networks” is just as descriptive and not problematic.

Posted in AI

The Simplest Deep Learning Program That Could Possibly Work

Once upon a time, when I, a C programmer, first learned Smalltalk, I remember lamenting to J.D. Hildebrand “I just don’t get it: where’s the main()?” Eventually I figured it out, but the lesson remained: Sometimes when learning a new paradigm, what you need isn’t a huge tutorial, it’s the simplest thing possible.

With that in mind, here is the simplest Keras neural net that does something “hard” (learning and solving XOR) :

import numpy as np
from keras.models import Sequential
from keras.layers.core import Activation, Dense
from keras.optimizers import SGD

# Allocate the input and output arrays
X = np.zeros((4, 2), dtype='uint8')
y = np.zeros(4, dtype='uint8')

# Training data X[i] -> Y[i]
X[0] = [0, 0]
y[0] = 0
X[1] = [0, 1]
y[1] = 1
X[2] = [1, 0]
y[2] = 1
X[3] = [1, 1]
y[3] = 0

# Create a 2 (inputs) : 2 (middle) : 1 (output) model, with sigmoid activation
model = Sequential()
model.add(Dense(2, input_dim=2))

# Train using stochastic gradient descent
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)

# Run through the data `epochs` times
history =, y, epochs=10000, batch_size=4, verbose=0)

# Test the result (uses same X as used for training)
print (model.predict(X))

If you run this, there will be a startup time of several seconds while the libraries load and the model is built, and then you will start to see output from the call to fit. After the data has been run through 10,000 times, the model will then try to predict the output. As you’ll see, the neural network has learned the proper set of weights to solve the XOR logic gate.

Now draw the rest of the owl.

Posted in AI

Deep Whalesharks

Whalesharks are the largest fish in the ocean, but very little is known about their movements (where they breed, for instance, has been a huge mystery, although there’s now pretty good evidence that some, at least, breed in the Galapagos).

Whalesharks have a “fingerprint” in the form of distinct spots on their front half. The current state-of-the-art technique for ID’ing whalesharks from photos is a pretty brilliant appropriation of an algorithm for locating astrophotographs in the sky:

  1. Extract Points Of Interest (POIs) from your target image
  2. Draw the mesh of triangles created from those points
  3. Create a histogram of the interior angles of those triangles
  4. Use that histogram as a “fingerprint”

Visualization of steps in astrophotography fingerprinting

The basic insight is that it’s not the absolute location of the points-of-interest but their internal relationships that you can rely on. You can rotate the source image by any amount and the internal angles of the mesh stay constant. And because you’re binning, this algorithm is at least somewhat robust against noise (either false POIs or, more likely, missing faint POIs).

This is a great algorithm that is amazingly good with astrophotography. But the geometry of the night sky is constant — our constellations appear very much as they did to people thousands of years ago. Whether taken last night or last century, a photograph of Orion is going to Betelgeuse in one corner, Rigel in another, and a belt between them.

Two photos of the same whaleshark, though, will almost certainly be from different angles and distances. Another challenge is that the dappling of sunlight and shadowing from surface waves causes a lot of signal noise. So, today, whaleshark researchers have to do a lot of manual processing to identify an animal from a photograph.

My thought is to apply modern data-science and machine-learning approaches to identifying individual whalesharks. The main goal is really to create an efficient pipeline and not, so much, creating a better identification algorithm. (To be honest, I’ve already tried several “simple things that could possibly work” ML techniques and not gotten any traction, but I’m not giving up.)

Notes on installing TensorFlow with GPU Support

The best Tensorflow is the one you have on your machine.

In my opinion, the bottleneck on a DNN solution is not training, but data preparation and iterating your model to the point where it’s reasonable to start investing kilowatt-hours of electricity to the training. So I have Tensorflow on all my machines, including my Macs, even though as of Tensorflow 1.2 GPU support is simply not available for Tensorflow on the Mac. (I’m not sure what’s going on, but suspect it may have something to do with licensing NVidia’s CuDNN library.)

Having said that, GPU support for TensorFlow is much faster than CPU-only Tensorflow (in some quick tests on my Windows laptops, ~8x). With GPU-supported Tensorflow, it’s that much easier to iterate your model until your training and validation curves start to look encouraging. At that point, in my opinion it makes sense to move your training to the cloud. There’s a little more friction in terms of moving data and starting and stopping runs and you’re paying for processing, but hopefully you’ve gotten to the point where training time is the bottleneck.


Mac Tensorflow GPU: I’d like to think this will change in the future, but as of August 2017: Nope.

There are a very few people who seem to have figured out how to build Tensorflow with GPU support on the Mac from sources, but the hoop-jumping and yak shaving that seems necessary seems very high to me.

Windows Tensorflow GPU: Yes, but it’s a little finicky. Here are some install notes:

– Install NVidia Cuda 8 (not the Cuda 9 RC)
– Install NVidia CuDNN 5.1 (not the CuDNN 7!)
– Copy the CuDNN .dll to your Cuda /bin directory (probably /Program Files/NVidia GPU Computing Toolkit/Cuda/v8.0/bin/)
– Create an Anaconda environment from an administrative shell. Important: use –python=3.5
– Install tensorflow using:

pip install --ignore-installed --upgrade

I think the “cp35” is the hint that you have to use Python 3.5, so if the page at changes to show a different .whl file, you’d have to set the python in your Anaconda environment differently.
– Validate that you’ve got GPU capability:

import tensorflow as tf

This should result in a cascade of messages, many of which say that Tensorflow wasn’t compiled with various CPU instructions, but most importantly, towards the end you should see a message that begins:

Creating Tensorflow device (/gpu:0)

which indicates that, sure enough, Tensorflow is going to run on your GPU.


Hope this helps!

Posted in AI

Keras is the Deep Learning Toolkit You Have Been Waiting For

I remember a time when “learning about computers” invariably started with the phrase “computers only operate on 0s and 1s…” Things could vary a little for a few minutes, but then you’d get to the meat of things: Boolean logic. “All computer programs are formed from these ‘logic gates’…”

I remember a poster that illustrated Boolean logic in terms of punching. A circuit consisted of a bunch of mechanical fists, an “AND” gate propagated the punch when both its input were punched, an “OR” required only one input punch, etc. At the bottom were some complex circuits and the ominous question: “Are you going to be punched?” Because Boston. (The answer was “Yes. You are going to be punched.”)

Anyway, the point is that while there was a fundamental truth to what I was being told, it was not overwhelmingly relevant to the opportunities that were blossoming, back then at the dawn of the personal computer revolution. Yes, it’s important to eventually understand gates and circuits and transistors and yes, there’s a truth that “this is all computers do,” but that understanding was not immediately necessary to get cool results, such as endlessly printing “Help, I am caught in a program loop!” or playing Nim or Hammurabi. Those things required simply typing in a page or two of BASIC code.

Transcription being what it is, you’d make mistakes and curiosity being what it is, you’d mess around to see what you could alter to customize the game, and then your ambition would slowly grow and only then would you start to benefit from understanding the foundations on which you were building.

Which brings us to deep learning.

You have undoubtedly noticed the rising tide of AI-related news involving “deep neural nets.” Speech synthesis, Deep Dream’s hallucinogenic dog-slugs, and perhaps most impressively AlphaGo’s success against the 9-dan Lee Sedol. Unlike robotics and autonomous vehicles and the like, this is purely software-based: this is our territory.

But “learning about deep learning” invariably starts with phrases involving the phrases “regression,” “linearly inseparable,” and “gradient descent.” It gets math-y pretty quickly.

Now, just as “it’s all just 0s and 1s” is both true but not immediately necessary, “it’s all just weights and transfer functions,” is something for which_eventually_ you will want to have an intuition. But the breakthroughs in recent years have not come about so much because of advances at this foundational level, but rather from a dramatic increase in sophistication about how neural networks are “shaped.”

Not long ago, the most common structure for an artificial neural network was an input layer with a number of neural “nodes” equal to the number of inputs, an output layer with a node per output value, and a single intermediate layer. The “deep” in “deep learning” is nothing more than networks that have more than a single intermediate layer!

Another major area of advancement is approaches that are more complex than “an input node equal to the number of inputs.” Recurrence, convolution, attention… all of these terms relate to this idea of the “shape” of the neural net and the manner in which inputs and intermediate terms are handled.

… snip descent into rabbit-hole …

The Keras library allows you to work at this higher level of abstraction, while running on top of either Theano or TensorFlow, lower-level libraries that provide high-performance implementations of the math-y stuff. This is a Keras description of a neural network that can solve the XOR logic gate. (“You will get punched if one, but not both of the input faces gets punched.”)

[code lang=”python”]
import numpy as np
from keras.models import Sequential
from keras.layers.core import Activation, Dense
from keras.optimizers import SGD

X = np.zeros((4, 2), dtype=’uint8′)
y = np.zeros(4, dtype=’uint8′)

X[0] = [0, 0]
y[0] = 0
X[1] = [0, 1]
y[1] = 1
X[2] = [1, 0]
y[2] = 1
X[3] = [1, 1]
y[3] = 0

model = Sequential()
model.add(Dense(2, input_dim=2))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss=’mean_squared_error’, optimizer=sgd, class_mode="binary")

history =, y, nb_epoch=10000, batch_size=4, show_accuracy=True, verbose=0)

print (model.predict(X))

I’m not claiming that this should be crystal clear to a newcomer, but I do contend that it’s pretty dang approachable. If you wanted to produce a different logic gate, you could certainly figure out what lines to change. If someone told you “The ReLu activation function is used more often than sigmoid nowadays,” your most likely ‘let me see if this works’ would, in fact, work (as long as you guessed you should stick with lowercase).

For historical reasons, solving XOR is pretty much the “Hello, World!” of neural nets. It can be done with relatively little code in any neural network library and can be done in a few dozen lines of mainstream programming languages (my first published article was a neural network in about 100 lines of C++. That was… a long time ago…).

But Keras is not at all restricted to toy problems. Not at all. Check this out. Or this. Keras provides the appropriate abstraction level for everything from introductory to research-level explorations.

Now, is it necessary for workaday developers to become familiar with deep learning? I think the honest answer to that is “not yet.” There’s still a very large gap between “what neural nets do well” and “what use-cases are the average developer being asked to addressed?”

But I think that may change in a surprisingly short amount of time. In broad terms, what artificial neural nets do is recognize patterns in noisy signals. If you have a super-clean signal, traditional programming with those binary gates works. More importantly, lots of problems don’t seem easily cast into “recognizing a pattern in a signal.” But part of what’s happening in the field of deep learning is very rapid development of techniques and patterns for re-casting problems in just this way. So-called “sequence-to-sequence” problems such as language translation are beginning to rapidly fall to the surprisingly effective techniques of deep learning.

… snip descent into rabbit-hole …

Lots of problems and sub-problems can be described in terms of “sequence-to-sequence.” The synergy between memory, attention, and sequence-to-sequence — all areas of rapid advancement — is tipping-point stuff. This is the stuff of which symbolic processing is made. When that happens, we’re talking about real “artificial intelligence.” Artifical intelligence, yes, but not, I think, human-level cognition. I strongly suspect that human-level, general-purpose AI will have a trajectory similar to medicine based on genetics: more complex and messy and tangled to be cracked with a single breakthrough.

The Half-Baked Neural Net APIs of iOS 10

iOS 10 contains 2 sets of APIs relating to Artificial Neural Nets and Deep Learning, aka The New New Thing. Unfortunately, both APIs are bizarrely incomplete: they allow you to specify the topology of the neural net, but have no facility for training.

I say this is “bizarre” for two reasons:

  • Topology and the results of training are inextricably linked; and
  • Topology is static

The training of a neural net is, ultimately, just setting the weighting factors for the elements in the network topology: for every connection in the network, you have some weighting factor. A network topology without weights is useless. A training process results in weights for that specific topology.

Topologies are static: neural nets do not modify their topologies at runtime. (Topologies are not generally modified even during training: instead, generally the experimenter uses their intuition to create a topology that they then train.) The topology of a neural net ought to be declarative and probably ought to be loaded from a configuration file, along with the weights that result from training.

When I first saw the iOS 10 APIs, I thought it was possible that Apple was going to reveal a high-level tool for defining and training ANNs: something like Quartz Composer, but for Neural Networks. Or, perhaps, some kind of iCloud-based service for doing the training. But instead, at the sessions at WWDC they said that the model was to develop and train your networks in something like Theanos and then use the APIs.

This is how it works:

  • Do all of your development using some set of tools not from Apple, but make sure that your results are restricted to the runtime capabilities of the Apple neural APIs.
  • When you’re done, you’ll have two things: a network graph and weights for each connection in that graph
  • In your code, use the Apple neural APIs to recreate the network graph.
  • As a resource (download or load from file) the weights.
  • Back in your code, stitch together the weights and the graph. One mistake and you’re toast. If you discover a new, more efficient, topology, you’ll have to change your binary.

This is my prediction: Anyone who uses these APIs is going to instantly write a higher-level API that combines the definition of the topology with the setting of the weights. I mean: Duh.

Now, to be fair, you could implement your own training algorithm on the device and modify the weights of a pre-existing neural network based on device-specific results. Which makes sense if you’re Apple and want to do as much of the Siri / Image recognition / Voice recognition heavy lifting on the device as possible but allow for a certain amount of runtime flexibility. That is, you do the vast majority of the training during development, download the very complex topology and weight resources, but allow the device to modify the weights by a few percent. But even in that case, either your topology stays static or you build it based on a declarative configuration file, which means that whichever route you choose, you’re still talking about a half-baked API.


Is Watson Elementary?: Pt. 2

I used to be on top of  Artificial Intelligence — I wrote a column for and ultimately went on to be the Editor-in-Chief of AI Expert, the leading trade magazine in the AI field at the time. I’ve tried to stay, not professionally competent, but familiar with the field. That has been rather difficult because the AI field has largely put aside grand theories and adopted two pragmatic themes: statistical techniques and mixed-approaches.

Statistical techniques rely on large bodies of data that allow you to guess, for instance, that “push comes to”->”shove” not from any understanding of metaphor or causation but because the word “push” followed by “comes” followed by “to” is followed 87.3% of the time by the word “shove”. Statistics excel at extracting patterns from large input sets.

Mixed approaches are ones which use different strategies to try to tackle different aspects or stages of a problem. Imagine a blackboard around which people raise their hands, come forward, add or erase a small bit of information, and step back into the crowd. For instance, one (relatively) simple tool might know that “X comes to Y” implies temporal ordering. Another might say that temporal ordering implies escalation. And another might say “A ‘Shove’ is an escalation of a ‘Push'”.

The more I read about Watson, the more it seems that while Watson used mixed approaches, what it’s mixing are almost all statistical techniques. So while it would undoubtedly be able to answer that “shove” is what “push often comes to…” I think it would do so without any reasoning, or schema, about temporal ordering or escalation.

The problem with statistical techniques is they are not general.

If a child is shown how to win tic-tac-toe by always starting with a ‘X’ in the upper-left box, and then we asked them if they could always win by starting in another corner, we would be disappointed if they couldn’t figure it out. Maybe not at first, but if tic-tac-toe was something they enjoyed, they would eventually recognize the pattern. If they never achieved the recognition, it would be troubling.

Pattern recognition, not pattern extraction, seems to be “how” we work. If pattern extraction were at the core, we wouldn’t be troubled by sharks when entering the ocean and we wouldn’t spend money on lottery tickets.

So it seems that Watson uses a fundamentally different “how” in its achievement. Yet the capability of rapidly and accurately answering questions (ones that have been intentionally obfuscated!) is clearly epochal. Clearly Watson has a role in medicine (diagnostics), law and regulatory compliance (is there precedent? is this a restricted behavior?), and intelligence (where’s the next revolution likely?). The problems of “Big Data” are very much in the mind of the software development community and Watson is a stunning leap forward in combining big data, processing power, and specialized algorithms.

Posted in AI

ResolverOne: Best Spreadsheet Wins $17K

ResolverOne is one of my favorite applications in the past few years. It’s a spreadsheet powered by IronPython. Spreadsheets are among the most powerful intellectual tools ever developed: if you can solve your problem with a spreadsheet, a spreadsheet is probably the fastest way to solve it. Yet there are certain things that spreadsheets don’t do well: recursion, branching, etc.

Python is a clean, modern programming language with a large and still-growing community. It’s a language which works well for writing 10 lines of code or 1,000 lines of code. (ResolverOne itself is more than 100K of Python, so I guess it works at that level, too!)

From now (Dec 2008) to May 2009, Resolver Systems is giving away $2K per month to the best spreadsheet built in ResolverOne. The best spreadsheet received during the competition gets the grand prize of an additional $15K.

Personally, it seems to me that the great advantage of the spreadsheet paradigm is a very screen-dense way of visualizing a large amount of data and very easy access to input parameters. Meanwhile, Python can be used to create arbitrarily-complex core algorithms. The combination seems ideal for tinkering in areas such as machine learning and simulation.

I try to do some recreational programming every year between Christmas and New Year. I’m not sure I’ll have the time this year, but if I do, I may well use ResolverOne and Python to do something.

IronPython 2.0 & Microsoft Research Infer.NET 2.2

 import sys import clr sys.path.append("c:\\program files\\Microsoft Research\\Infer.NET 2.2\\bin\\debug") clr.AddReferenceToFile("Infer.Compiler.dll") clr.AddReferenceToFile("Infer.Runtime.dll") from MicrosoftResearch.Infer import * from MicrosoftResearch.Infer.Models import * from MicrosoftResearch.Infer.Distributions import *  firstCoin = Variable[bool].Bernoulli(0.5) secondCoin = Variable[bool].Bernoulli(0.5) bothHeads = firstCoin & secondCoin ie = InferenceEngine() print ie.Infer(bothHeads) --> c:\Users\Larry O'Brien\Documents\Infer.NET 2.2>ipy Compiling model...done. Initialising...done. Iterating: .........|.........|.........|.........|.........| 50 Bernoulli(0.25)