The software at this site was written with the following engineering design principles in mind:
- Simplicity: as simple as possible but no simpler. (That is
a paraphrase of a statement attributed to Einstein about physics.)
The most simple methods are used that do not significantly
sacrifice performance.
- If it would take 1000 lines of code to double the performance of a 20 line algorithm, forget it; we'll use the 20 line method.
- If the simplest efficient method is a bit tricky, as with a general implementation of rasterops, so be it.
- Efficiency. This is the yin to the yang of simplicity. The goal is to have the algorithm be both simple and efficient. Data is handled 32 bits at a time whenever possible. But there is often room for improving efficiency if the need exists. For example, I generate and use 8-bit lookup tables throughout, but given the large size of caches in current systems, 16-bit tables are more efficient.
- Clarity. I attempt always to make the method clear. The goal is to allow a C programmer to read the source, and any associated information, and understand exactly what is happening. Comments are placed in the code so that the method can be quickly understood. The naming convention in the higher-level code is object-"oriented". The function name starts with the primary object being operated on or being returned.
- Portability.
I attempt to make the software as portable as possible.
This has several aspects:
- Written in C. (Also for efficiency and simplicity.)
- A low-level implementation, using only intrinsic C data types, that is (almost) independent of the application-level image data structures, is (almost) always split out from the higher-level code.
- There are two requirements on the image data array
of the application-level image data structure:
- The array must be in a single consecutive block.
- Each raster line must start on a 32-bit word boundary.
- A simple image data structure is used to (informally) verify correctness and show how to use the lower-level functions. In some cases, such as rotation by shear, the details of the algorithm are mostly in the higher-level code.
- Endian independent.
- Typedefs for intrinsic data types that can easily be changed and are written to avoid conflicts with other libraries.
- Use of macros for common operations, e.g. allocation of heap memory, where programmers often want to make substitutions determined at compile time.
- Code safety. This may seem like an oxymoron when
applied to C, but there are several things that do help
in the higher-level code. The emphasis is on
safe memory allocation and deallocation, avoiding
references through null pointers, and (optionally) giving
useful debugging information when something goes wrong.
A little work to prevent the library from crashing
applications more than makes up for all the time users
of the library would have spent mucking around in gdb.
You may find the code heavy on checks of variables
and function returns, but the alternative is worse.
- Do all memory allocation in higher-level code.
- Check all pointers passed to higher-level code functions.
- Check return values; in particular, for pointers to allocated data.
- Usually allocate arrays and data structures on the heap, either returning them or destroying them before the function returns.
- Use accessors for fields in high-level data structures, that check the pointer value before the access.
- Provide create() and destroy() functions for all data structures.
- Avoid globals, including statics, except for constants.
- Require each calling function to clean up all arrays that it is responsible for allocating and that are not put in a data structure.
- Use a mechanism, selectable at compile time, that gives a trace of the call stack when a test for memory allocation or access fails.
- Make evident which function is responsible for destroying each data structure. This is most easily arranged by requiring the function that calls create() for a data structure to call destroy() on it. Sometimes this is not convenient. For example, you may have an init() function that calls create() and a cleanup() function that calls destroy() on a number of data structures. Another example is when you create a structure and add it to an array of pointers to similar structs. If you add it by insertion (of the pointer), the protocol is that it is now owned by the array, and the function that destroys the array takes care of it. In the first example, some higher-level function is required to call both init() and cleanup(). In all cases, it is important to have a sensible, consistent and documented protocol.
- Never call exit() from a library.
Marin Saric recently pointed me to Robert Glass's "Facts and Fallacies of Software Engineering," Addison-Wesley, 2003. Of the many "Facts" in that book, the most interesting I found were on the life cycle of software. Considering the five stages (requirements, design, coding, error removal and maintenance), maintenance occupies about 60 percent of the total (Fact 41). Further, about 60 percent of maintenance is for enhancements, as opposed to bug fixes -- "software maintenance is largely about adding new capability to old software, not fixing it." (Fact 42). And then the kicker: "Better software engineering leads to more maintenance, not less." (Fact 45). Better software is better structured and easier to understand and modify. It encourages modification. So Glass can conclude that maintenance is a solution, not a problem (Fact 43).
There are many other gems in this book, describing, for example, the disconnect between engineers and managers, the problems with schedule-fixing and subsequent changes in requirements, the problems that arise when the designers don't do the coding, the absence of updated design docs or retrospective design descriptions that would help people maintain the software, the lack of a silver bullet for building reliable software, and the various methods that have been used for removing errors. (One of the best such methods is rigorous inspection -- Fact 37).
Glass says that quality is a collection of attributes. His seven "-abilities", not in any particular order, are portability, reliability, efficiency, usability, testability, understandability and modifiability. I found this interesting because they are the same requirements I have informally put on Leptonica. To mention three, I have put some effort into "understandability," both with comments within the code and with web pages describing the problem addressed and its applications, the algorithms used, and the resulting functionality, sometimes going down to the actual form of the functional interface. This documentation helps me (and others) modify the code. Because the Leptonica library is a programmer's interface, "usability" and "portability" go together to some extent; for example, the separation between high-level code that uses simple data structures and low-level code that uses only built-in C data types, and the use of a very simple data structure for generic images. Further, the definition and re-use of a small number of data structures simplifies the application of the Leptonica library to various problems. And the various constructs and patterns I use to catch errors and prevent program crashes contribute to both reliability and usability. These are discussed in more detail above.
Description 1:
A library is a set of abstractions that should allow you to
get the information you need without sweating all the low-level
details. For image processing, those "low-level details" are
often relations between pixel values.
Description 2:
A library is a set of functions that typically (but not always)
take a set of inputs and generate a set of outputs.
(If the function doesn't give you anything back, why would
you ever run it?)
Look at the description of the function.
It typically says what you need to provide and what you get
in return. It should also specify any constraints on the input.
Description 3:
A library is a set of contracts: "If you give me this,
I will give you that." That's the explicit part.
But there is more to the contract than what goes
in and what comes out. There is a very important implicit part
that governs:
- ownership: Who owns any stuff that is allocated?
- side effects: What are you allowed to do to the input? Are globals or static variables being modified?
- error handling: What happens when an error occurs?
- integration: What issues may come up?
With all of these, the contract should have a default condition throughout the library, and any deviations from the default need to be noted in the function description.
We've discussed ownership, side effects and some of the integration and portability issue in the previous section. For example, portability is enhanced by using POSIX library functions, being careful with function, typedef and global constant names to avoid namespace clashes, and using endian-independent implementations. Here, I want to say a bit about error handling.
What to do when an error occurs is a vexing problem because users will want the behavior to be anything from "crash immediately if an error occurs" to "just carry on, I expect errors because I have errorful or highly variable input and may not want (or be able) to test for abnormal conditions at each point." The degree to which the input and processing actions are constrained in the application is crucial. For image analysis applications, there is often sufficient uncertainty and variability in the image content that the programs will need to handle unanticipated situations. There is little I find more annoying than using a library where a function crashes the application by calling exit on error. Leptonica is guaranteed not to do that, ever.
Some languages like Java have error handling primitives (try, catch) built in. Because C is more primitive and low-level, error handling is done in an ad hoc manner of the library designer's choosing.
The user's needs can vary depending if you're in a development environment (e.g., "unwind the call stack with appropriate messages") or in production (e.g., "crash so I can figure out what went wrong" or "go on, but don't show me a lot of debug messages"). The library should be able to accommodate all these different ways of handling errors. It should also be reasonably good about cleaning up garbage when a function ends prematurely.
Another option in the user's implementation is whether or not to make an effort to inspect all returned objects and to explicitly say what to do in each case. For robust applications, it is desirable to check returned values to determine if an error has occurred. But this should be a programmer decision, and, in particular, the library functions should be written in such a way that an application should not crash if it continues after an error. It is obvious that the library function should not corrupt (or lose) memory, no matter what. But at the application level, how does the library prevent a crash? In general, of course, it can't with a language like C where there is access to pointers.
At the application level, most crashes are due to dereferencing an invalid pointer. Leptonica allocates nearly everything on the heap, so there are plenty of pointers being passed around. Leptonica data structures are initialized with calloc, so that all pointers in a struct are automatically set to NULL. Accessors are provided so that the programmer has no excuse to reference a field directly (and thereby possibly dereference a null pointer).
A tricky place is where a function returns a pointer to a struct that it allocated. The pointer itself is typically allocated on the stack, uninitialized, by the calling function and then its address passed in to the function that allocates the struct. An example should make this clear:
...
Pix *pix = NULL;
Pixa *pixa; // ptr allocated on stack but not initialized
// Heap allocated pixa should be returned, but in this case
// because pix == NULL, pixa doesn't get allocated.
// Suppose it is not initialized in the function.
Boxa *boxa = pixConnComp(pix, &pixa, 8);
...
Pix *pixt = pixaGetPix(pixa, 0, L_CLONE); // segfault
The function can fail to do the allocation if it returns early. In the code above, pixConnComp() returns immediately because pix is NULL. What about pixa? It is never allocated, but if it is not initialized before pix is tested, it will retain its random value. In that case, any use is almost certain to crash the program. Note that if the program were written to catch the error by testing the returned boxa, an uninitialized pixa should never cause trouble. If pixConnComp() does early initialization, the program won't crash under any circumstances.
The goal is thus to provide a programming environment with C library functions that is as close as possible to the runtime safety of an interpreter (like Matlab). Leptonica supports various options when an error occurs -- crash immediately, give an error message and return from the function, return silently -- by using error handling macros throughout. By default, these print the name of the calling function with an error message, and return with some value cast to the return type of the function.
Most EE/CS students get an introductory course that compares programming languages. Here, I present a somewhat biased summary of the extreme features of programming languages, with a bit of history thrown in. We start by noting that extremism is the rule, not the exception, in programming languages.
Whereas compiler design is an engineering discipline (in fact, an offshoot of discrete mathematics), the design of programming languages is an art form, notwithstanding nearly 50 years of intense experimentation. There is no manual or set of rules that tell you how to design a language. The languages that have been designed and used display a wonderful diversity, but most of the ones that are well-known, if not widely used, take some characteristic to extremes. This might seem puzzling: why are the best known languages typically pure in some aspect rather than amalgams of methods that have been shown to be generally useful?
Let's consider some examples. Xerox PARC developed three different programming environments in the 1970s. These were all optimized to run on the same bit-sliced 16-bit hardware, using microcode that defined the basic machine operations for each of the three systems. The three languages used were Lisp, Smalltalk and Mesa/Cedar. Each of the three systems was written in the programming language and supported only programs written in that language.
Lisp is an extreme language because it has the most simple syntax invented: everything is a list. Unless the list is "quoted," which means to take it literally, the first element is an operation and the remaining elements are data to be acted on. The parser for Lisp needs to identify only parenthesis and quote signs, and evaluate the expressions from the inside-out. Lisp is also extreme in that no data in Lisp needs to be explicitly typed. (It shares this characteristic with most scripting languages, where typing is done implicitly.)
Smalltalk is an extreme language because it has the most uniform set of datatypes invented: everything is an object. Even the integer 3 is an object. All objects have data and functions associated with them. (The Smalltalk inventors called procedures that act on objects "methods", thus overloading yet another common word.) The objects are arranged in a hierarchy that allows subclassing with inheritance of data and functions from above.
Unlike Lisp and Smalltalk, Mesa/Cedar was extreme in that it was strongly typed. You could defeat the typing, but it took more than a simple cast and was heavily discouraged, requiring something like signing your name in blood to the deprecated code. The strong typing was necessary because Mesa/Cedar shared one fundamental aspect with Lisp and Smalltalk: all functions ran in the same memory space as the operating system! This in itself was an extreme decision, because a poorly written piece of code in Mesa/Cedar could bring everything down. Lisp and Smalltalk were languages that were meant to be interpreted, and the designers absorbed protection against memory smashes into the inefficiencies that naturally come with interpreters. But Mesa/Cedar was meant to be an efficient language and programming environment, with optimizing compilers; no runtime cycles were to be wasted checking memory accesses. Hence the requirement for strong typing. Mesa also incorporated another safety feature of Lisp and Smalltalk: garbage collection. You couldn't afford to have programmers messing with dynamic memory allocation! And because of the strong typing, garbage collection was a relatively straightforward engineering task. (As an interesting comparison, the line of Microsoft operating systems including Windows 3.1, 95, 98, and 98SE, which were sold until 2001 when XP was introduced, had neither pre-emptive multitasking nor memory protection to encapsulate the addressing of applications. Consequently, any application could hang or crash the entire system. In view of the fact that, with Unix, people knew how to design an operating system over 30 years ago that was protected from bad behavior by user-level processes, and further, we have had on-chip hardware memory protection since the mid 1980s (e.g., in the Intel 80386 and Motorola 68020), this seems to be a very strange way to design system software.)
Most of the other languages have had extreme characteristics. C, the most successful language ever developed, is extreme in its minimal number of operations and built-in data types, and in its close relationship to machine operations. Its minimality is in sharp to contrast to Pascal, which incorporated into the language many of the several hundred functions that C includes through libraries. The oft-heard disparagement that C is an "assembler language" is an example of the extreme positions that programmers often take. In fact, C has no control of registers and supports manipulations on arbitrarily complex data types.
Forth is extreme in that it uses an explicit stack. Whereas many other languages have interpreters or compilers that use stacks internally, Forth requires the programmer to push and pop every piece of data to and from a stack, and, worse, to remember exactly what is on the stack at all times. There is an advantage for the designers of Forth in putting all this mental effort on the programmer: the interpreter is trivial; it just maintains the data on the stack. Another disadvantage of this simplicity is that source-level debuggers are (to my knowledge) unknown; if something goes wrong, the most you get is a description of the stack at that moment. PostScript is another example of a pure stack-oriented language, but, of course, PostScript was never designed as a programming language. The only things that the PostScript designers ever intended to be written in PostScript were print and display drivers for specific output devices.
The list goes on. Does anyone remember prolog? Prolog was extreme in that it was intended for "predicate logic" programming: each statement defined some logical relation between variables, and the interpreter treated these relations as rules from which it would try to develop other connections using boolean logic. This was an example of declarative programming, as opposed the usual method where the programmer tells the machine what to do. Logic programming was used for special applications such as theorem-proving and so-called "AI expert systems." Several AI proponents, notably Feigenbaum at Stanford, proclaimed in hysterical writings in the late 1980s that Japan was going to use prolog to build a "Fifth-Generation" computer that would be so "smart" that it would wipe out the high tech industry in the United States within five years. (Of course, he was trying to get more funding for his own AI projects.) After five years, with the Fifth-Generation project a complete failure, it appeared that prolog was merely a French plot to retard the advance of Japanese technology. Predicate logic programs have been used with great success in Mathematica (and its predecessors) to solve difficult problems in mathematics. This is programming by pattern-substitution: you put in rules, such as "if you see a pattern that looks like this, do that." With such methods, for example, you can use Mathematica to get a closed-form expression for any integral you can find in Gradshteyn and Ryzhik's monumental book Tables of integrals, series and products!
And then there is ML, which used to be popular in Scotland, at least. ML is an extreme programming language because there are no side effects. Lisp programming practice might tend in this direction, but in ML, it's the law. The way I visualize an ML program is that procedures get called by other procedures ad infinitum, but everything returned is just input to another function -- think of the data as always coming back onto the stack that holds input variables for the next function call. At the end, when the music stops, you find that nothing remains of all the activity. I've been told that one of the trickiest programming tasks in ML is to write a for-loop iterator, perhaps because it needs to keep some state around to test for exit. I hope someone will explain to me the allure of the extremism in the design of ML.
We're not finished yet. Scripting languages, of which Perl is the best known, are extreme in that all data is treated as strings, with arithmetic data being typed implicitly. Efficiency is sacrificed for ease of use. These languages do more than use pattern matching to manipulate strings. They have been extended to invoke programs, like a shell script on steroids. Perl had the lucky timing to be on the scene in 1995 for the WWW revolution; consequently, it has been the most common scripting language used for server-side CGI web applications.
Python is another scripting language that has no data typing. It is much more powerful than Perl, because it has objects, inheritance, the Tk X graphical display library, and a very large number of application modules to handle the interface to everything from databases to internet applications. Python is extreme in its typographic minimalism. The parser relies on typographic appearance: white space is an integral part of the syntax. Python enthusiasts paraphrase Sierra Madre, "We don't need no stinking braces," because scope and nesting is determined by indentation alone! And there are no semicolons, because statements do not carry over between lines. This is opposite to most languages, which ignore white space entirely (including newlines). Python can be extended with function calls to procedures compiled from other languages (e.g., C, C++). It is the most interesting scripting language, in terms of simplicity and capability, yet invented.
Even C++ is extreme in a particular way: while maintaining the constraint of including C as a subset, C++ adopted nearly every object-oriented programming feature known to man, including multiple inheritance. It is thus extreme in its "kitchen-sink" approach to a programming language. Additionally, C++ allows operator overloading. This is illustrated by the canonical "cute" example of complex arithmetic. However, it lends itself to very serious abuse because the apparent syntax of the language can be changed at will. And unfortunately it often is, making it very difficult to understand what is actually happening. This is similar to the competitive stunts people have done with the macro capability of the C preprocessor; however, with C++, programmers do this for real. Another piece of syntactic sugar is the automatic invocation of constructors, which can lead to extra copying and inefficiencies if the programmer is not very careful.
Java is even more recent, and is the first serious programming language that is, in my opinion, not extreme. (Well, it is extreme in that all procedures are "methods" of an object; you cannot write a function that is not within the file and scope of some object, even if that object would never be instantiated.) It has a carefully selected set of built-in features from C and C++, plus aspects that one finds in interpreted languages, such as a byte-code generator with a byte-code interpreter, automatic memory management, and threads. (Interestingly, byte-codes, automatic garbage collection and lightweight processes were all included in Xerox's Cedar.) Java thus is an attempt to make a language that takes the best features of C++, leaves the worst (operator overloading, multiple inheritance), and adds new safety features (garbage collection) and a byte code compiler for portability and safety in web applications. Like Python, it is distinguished by having a very large library of functions for doing application and systems programming. Microsoft considered it to be a sufficiently dangerous framework for non-Microsoft applications that they went to great lengths to prevent it from working on Windows. Thanks, Bill.
The psychological basis of extremism in programming is very simple: people crave simplicity and equate it to beauty and truth. In the early days, when the engineering principles of compilers were in a rudimentary stage, it was also good engineering to design a system with a simple parser. With today's compiler building tools, such as lex and yacc (or flex and bison), it is not difficult to build an interpreter/compiler for a language such as C, so there is no excuse for putting a lot of bookkeeping baggage on the back of the programmer. But compilers for a language as large as C++ have proven to be difficult to build, and some recently released versions of the GNU g++ compiler (e.g., 2.96) have been buggy.
This is all an example of Bloomberg's Law of Complexity, which paradoxically stands in firm opposition to the psychological desire for simplicity and elegance: the state of technology at any time is as complicated as it possibly can be and still mostly work. This explains the ever-increasing complexity of languages like C++ and operating systems like Windows. A corollary to the law of complexity is: if something can be added, it will.
Historical note on limits of extremism: in the late 1980s, when Xerox saw that its power-hungry homemade GaAs-based Dorados could not keep up with low-power commercial CMOS-based microprocessors, the Cedar runtime was ported to other hardware (e.g., SPARC), thus allowing Cedar to run as a single large process under Unix. The Cedar runtime supported the lightweight threads, that were invisible to the underlying OS. To maintain efficient operation, the port was done by compiling the Cedar/Mesa code to C, and then using the native C compiler. This was part of the unraveling of the short-lived age of monolithic computer environments. Years earlier, Peter Deutsch ported Smalltalk to the 68010-based Sun2 (by writing a virtual machine in C to emulate the virtual machine previously implemented directly on hardware by microcode), thus paving the way for the ParcPlace spinout. A similar approach was taken later to port the Xerox CommonLisp system to Suns, in a last-ditch effort to commercialize Xerox's Lisp programming environment.
I predict that the availability of valgrind will make significant improvements in the quality of open source code. For example, when you run valgrind with the png library, it finds memory access errors. And there are so many errors with the GNU C library that Seward provides a special suppression file containing errors that are not to be flagged!
Of course, it's better not to make the memory errors in the first place. Above, I've given some of the design principles on how memory can be managed in relatively safety. One basic idea is to replicate the data when needed, such as if you must embed some instance of a structure in more than one data structure. The alternative is to rely on access to the same instance from different places in the software, and you may force the application programmer to remember who owns what and in what order allocated data should be collected. This is generally bad practice, because it complicates the use of the library and invites serious error. With replication, each user of the data can have their own copy, for which they are responsible.
Replication, however, can be expensive, particularly when you are copying large structures like images. The simplest solution to this problem is reference counting. For image and box data structures, I allow multiple handles to the same data, with a reference count within the PIX and BOX so that we know how many handles are around at any time. To control the process, it is necessary to call a special function, pixClone() or boxClone() to get a new handle (you can't simply assign a new handle to the image!). It is also necessary to call a destruction function (e.g., pixDestroy()) on every handle. Only when the final handle to a PIX or BOX is destroyed does the reference count go to zero and the actual PIX or BOX structure can be freed. By following these simple rules, the programmer can write safe code without memory leaks, and without worrying about what handle "owns" what structure.
A common programming pattern that uses the pixClone() arises when you have an image processing function that specifies both a source and destination image, where if the two images are the same, the operation is intended to be in-place, but otherwise it is not. Most operations cannot easily be done in-place, so the in-place situation is implemented by first copying the source image to a temporary image, which then acts as the source so that the result can be written to the input source image. To simplify the use of both options in the implementation, if the operation is not in-place, you still make a temporary image, but this one is a clone of the source, so that it is really just another handle to the same data structure. Then the operation always proceeds using the temporary image as the source and writing the result into the destination. Finally, the temporary image is always destroyed. If the temporary image is a clone of the source, it will have a reference count of 2, and its "destruction" will simply lower that count back to 1.
I mentioned above that we use reference counting to allow two "clients" to hold on to the same data structure. The most important context in which this applies is where the "clients" are pointer arrays; here, the PIXA and BOXA. With reference counting, two PIXA arrays can hold pointers to the same PIX, and each can "own" it in the sense that when the pixaDestroy() function is called on the two PIXA, no PIX is freed twice.
Here is a simple but very important rule that applies to all ptr arrays that are not stacks or queues. It is often the case that you need to add structures to the end of an array, but don't need to replace them. You just accumulate them in the array, for use either individually or by functions that use the whole array. For such arrays, of which the PIXA and BOXA are examples, it is simplest to provide two functions: a procedure to add a structure to the end of the array and another to access a structure within an array without removing it. (Again, if you find you need to both add and remove structures from the end of an array, you should use a stack or a queue.) Here's the rule for accessing a structure without removing it:
It all comes down to ownership: the entity that is responsible for destroying the object. To set up the safest programming environment within the limitations of C, we do not allow access to a structure in an array (of PIX or BOX) without requiring a clone or copy to be made, even if you only want to determine the value of some internal field. This still incurs a slight overhead with a clone, because you must remember to destroy it.
Mathematicians stand on each other's shoulders,If you are going to write large programs, you must handle the complexity that comes from interactions between different parts, and, in particular, the pervasive influence of the "interfaces," both through functions and data structures. I hope you find the following at least entertaining.
While computer scientists stand on each other's toes,
And computer engineers dig each other's graves.
1. Fred Hoyle's N2 law of complexity
At age 16, I read Fred (later, Sir Fred) Hoyle's Frontiers of Astronomy, which reinforced my belief that science was the highest calling. Hoyle described the evolution of stars, and the nuclear reactions that were going on inside, in a way that demonstrated the power of observation and deductive logic.Hoyle was an unusual person. He made fundamental contributions to our understanding of nucleosynthesis in stars, and he worked on many different things. He speculated on the archaeological dating and purpose of Stonehenge. He gave lectures propounding a theory that flu viruses come from outer space, his evidence being that pandemics supposedly coincide with the 11 year solar activity cycle, and the flu spreads too quickly to be passed from one person to another! But this isn't the only instance where his speculations have been soundly rejected by the scientific community. Hoyle formulated a cosmological theory of the universe, called the Steady State universe, and defended it for 50 years until he died, in spite of overwhelming evidence to the contrary. Hoyle deprecatingly called Gamow's competing idea the "Big Bang" theory, a name that stuck! The Big Bang has since been confirmed by all experimental observations, and in particular, the cosmic microwave background radiation at 3 degrees K. Hoyle's Steady State cosmology was attractive in that it "solved" the problem of the origin of the universe by removing it: he claimed that there was no origin -- the universe has always been here, just as it is today. For the density to remain constant in an expanding universe, matter must be created out of the vacuum. Unfortunately for Hoyle, this has not been observed at the required rate (or at any rate, for that matter). I find it fascinating that Hoyle, along with his colleagues Gold, Narlikar and Bondi, all brilliant scientists, stuck to their theory with a stubbornness that defied logic. Why do these young people get an ideé fixé, that forms the central core of their scientific career? But I digress. Hoyle also enjoyed writing science fiction, and I eagerly read each book as it came out. A for Andromeda appeared in 1962. The plot has some young SETI (Search for Extra-Terrestrial Intelligence) scientists discover an unusual signal from outer space. When they finally crack the code -- I believe the "Rosetta Stone" for the code was the frequency of some spectroscopic absorption line in hydrogen -- they found that it was the description of a computer along with a mysterious computer program. They built the machine, ran the program, and -- voilà -- it made a human-like robot, Andromeda. Hoyle made one claim that impressed me: the complexity of a computer program goes as the square of its size. The program for Andromeda was huge, and its complexity was consequently unimaginable. Although I accepted Hoyle's claim, which was based on some loose reasoning about how the parts interact, it left me uneasy, because it stated that programming doesn't scale: large programs are not practical.
Although Hoyle was wrong, I ran into a situation that seemed to support his "law" soon after I went to Xerox. One of my colleagues gave me a plotting program written in Fortran that needed to be modified. This program was only about 500 lines long, but it was a maze of spaghetti code, with 'goto' branches swarming throughout a single function. In fact, the complexity may have been even higher than second power in size! Here, the complexity came from the strange internal states and twisted control flow, rather than from interfaces between different parts.
Although I've been sympathetic ever since to people like Wirth who wish to outlaw the 'goto' statement, I'm glad K&R included it in C because 'break' only takes you out one level.
2. Complexity at the interface
The center manager at Xerox PARC during the late 80s and 90s was a character. He shall remain nameless; I'll only refer to him by the initials of my sister Judith before she was married: jsb. I found it interesting to listen to jsb when he was in freestyle pontification mode. He would say strange things that were surely designed to rearrange your neural connections. Most of what jsb said was nonsense, but if you didn't take it seriously (and didn't tell him that you thought it was nonsense), you were likely to survive the experience.I decided to play an irreverent joke in 1993. At this time, each laboratory manager collected and forwarded monthly a set of research highlights up to the center manager, who would then send a subset on to the Xerox VP of research. I wrote a research highlight, with the winking approval of my laboratory manager, claiming that I had made a radical architectural breakthrough and had solved the interface problem in programming. My brilliant solution involved a large data structure I called a "KitchenSink." All functions took a KitchenSink and at most one other argument, and returned nothing of significance (except for the one function that actually made a KitchenSink, of course). In this bold stroke, I claimed to eliminate all of Perlis's problems with interfaces. All "side-effects" went into the KitchenSink, which itself was composed of an array of indeterminate size of pointers ('void*' pointers for C), any of which could point to an array or list of things. Among the things pointed to were, of course, KitchenSinks, that had been configured into some application-dependent formats, all eventually ending up as intrinsic types supported by the specific language. I mentioned that the recursive use of KitchenSinks was theoretically required to generate the complexity of a Turning Machine. This was complete nonsense, just smoke and mirrors to dress up the claim, but it fit with the AI paradigm of using recursion because of its elegance. This was also in the era when C++ and object-oriented programming were becoming fashionable, and I emphasized the maximal use of data hiding in the KitchenSink.
jsb did not forward my "breakthrough" to the corporate level. I believe that he was so astounded that an ex-physicist rather than an AI person might solve the biggest problem (he thought) in computer architecture, that he consulted my laboratory manager, who let him in on the joke.
I had another chance to spoof jsb on his interface beliefs. I prepared a humorous summary of conclusions from a mid-90s committee set up to evaluate technological opportunities in computing. There, I used the mathematics of high dimensional spaces, where the volume of a hypersphere is nearly all within a very small distance from the surface, to assert a proof that the interface is everything. The analogy and the proof were silly -- I gave it to amuse the physicists. But the geometry is significant, because it links the statistical definition of entropy (as proportional to the log of the number of accessible states) to the thermodynamic definition. Consider two identical systems, one completely isolated with known energy E and the second in contact with a heat bath at some temperature, freely exchanging energy, and consequently with an energy that averages to be E but has some uncertainty. All thermodynamic properties of the two systems are identical, so their entropies must be the same. But one has a fixed energy E so that the accessible states occupy a very thin hypersphere shell in momentum space, and the other occupies states in a thicker shell. The conclusion is that the thickness of the shell thus doesn't matter: the thinnest possible shell contains essentially all the states in 1023 dimensions!
The software engineers who built, maintained and extended the system were outstanding. They understood pattern classification and image analysis, and were very careful. Yet ultimately, they were overwhelmed by complexity. Perhaps some of them will eventually describe what they learned about complexity and how to avoid it.
Although silly, the KitchenSink joke demonstrates that architectural complexity is shared between function interfaces (verbs) and data structures (nouns). There must be some balance: you don't want to use a KitchenSink, but you need some data structures to bring order into the application. Without data structures, all the data that needs to be in scope must be passed in to a function or created there. For big applications, this would be a nightmare, with long strings of arguments of data passed between functions.
By contrast with this extreme, the object viewpoint is to have each function be responsible for a task, to divide tasks up into subtasks where appropriate, and to make a special data structure (one, if possible) for an application, such as JbClasser in the jbig2 classifier. This structure keeps all state and contains (most of) the data required for computation on it. You create the structure, you store computation results in it, and you use it to generate whatever output is required (such as images, or bounding boxes, or compressed file data, etc.). And finally, you destroy it with one command. Above all, at the application level you do not need to ask questions about what is inside. The object manages the details that the application programmer doesn't want to know about, and would be likely to cause problems if he were forced to manage them.
When using a language like C, it is especially important to have a set of classes (or structures) for common data sets, and for these to be implemented with accessors and other safety factors (such as cloning handles) so that the application programmer is unlikely to make a mistake that will either crash the program or leak memory.
By subdividing programs up into simple, logical parts, and using a library of (relatively) memory-safe functions on commonly-used data structures, large applications can be built where it is clear what is happening at each step. That, however, is not enough to manage complexity. Consider two types of programs:
Note that the process of going from a valid PostScript text file to a raster is well determined, but the inverse process (image analysis) is badly determined. It is in this second case, where the input is so weakly constrained, that the biggest problems are faced in managing complexity. Decisions have to be made based on characteristics of arbitrary binary data, rather than using well-formed strings to run a finite state machine that can be verified to behave correctly. This is the frontier, where engineering faces its biggest challenges and engineering principles have yet to be properly expressed.
Stepping into this frontier, I offer a few vague design principles for building and tuning image analysis programs. This little section is a work in progress, and I offer it with some humility. As mentioned above, the difficulty in analyzing images is due to the fact that, to a large extent, images are unstructured data. Unusual and unexpected things can and do appear. Consequently,