Five Times Faster Than Theoretically Possible
- For the same project as above, one of the other developers implemented the flood fill
function.
He was uncharacteristically proud of the work he had done, and stated he believed his
code was within a factor of 2 of the theoretical max performance achievable (for
the given hardware).
Unfortunately, his code was not meeting performance requirements.
It only took me a minute or two of looking at his code to determine that it was far
from optimal, and I took on the task of making a new implementation.
While a number of techniques not present in the original version were employed
(e.g. leaning on the graphics hardware for front-end and back-end processing and
exploiting spatial locality), the primary technique used (and one that was
immediately obvious to me but clearly not to the original developer) was to
implement the core fill code as a series of bitwise and shift operations such that
instead of individually processing each pixel, 32 were processed at a time.
My version ran 10x faster than his, at least 5x faster than his "theoretical max
performance", and met performance requirements.
- In my experience, the other developer referenced above is the norm.
Most developers, even among those considered to be eminently proficient programmers, do
not have the ability to really optimize for performance.
And this is not the only time I've created code that was not only significantly faster
than other developers' code, but multiple times faster. (I have yet to experience
the opposite case - where a colleague creates code that outperforms my own.)
Low Level Adventures
- Code of Death
- At Astronautics there was a project involving a custom 3rd-party single board computer
(SBC).
The project ran into an issue where the system would completely lock up, but only
rarely - typically only after multiple days of operation.
That made the problem extremely difficult to track down, and neither the SBC
manufacturer nor Astronautics made any progress on this issue for months.
Eventually I was temporarily pulled from the project I was working on and asked to take
on this problem.
After a few days I was able to pretty much rule out software and any sort of
deterministic hardware issue, and suspected it was likely either a power bypass or
crosstalk issue in the SBC.
I spent a couple of days using an oscilloscope to try to identify anything appearing
marginal on the SBC before deciding to create some code that would both stress power
bypassing and create a significant amount of electrical noise by slamming
alternating all-1s and all-0s on the memory bus at maximum speed (using MMX
instructions to get full 64 bit [i.e. memory bus width] operations on the otherwise
32 bit CPU, and bypassing the data cache so the data would actually hit the
CPU-external bus).
The resulting code was able to make the SBC hardware fail in about 3 seconds.
To verify my code was not crashing all by itself, I ran it on multiple standard IBM PC
compatibles, most of which worked fine, but one of which my code was able to take
down after a few hours.
Boeing, our customer for this product, referred to my code as "The Steidl Code of
Death".
With the help of this reproduction code, the makers of the SBC were finally able to
locate and fix the problem with their board, moving a capacitor to fix a power
bypass issue.
- While Astronautics does not provide letters of reference, I think noting who they
turned to when they ran into a difficult problem like this is informative.
- "
640K ought to be enough for anybody."
- I was tasked with creating a debug program for an embedded Pentium MMX system: Provide
a user interface via serial port, have various functions which allow reading/writing
I/O ports and memory, and implement assorted other features like performing a memory
initialization sequence.
That would be a non-notable project, except there's a catch: Two of the hardware issues
they wanted to debug with this program were why the SRAM wasn't working, and why the
SDRAM wasn't working, so the program had to function without any working RAM.
(The hardware did have working flash memory, so at least an executable image could be
loaded and executed.)
No RAM means no RAM-based stack which means using a C compiler for this project would
have been impractical.
Even with assembly (32 bit 80x86), practical code (quick/easy to implement and modify)
involves using some sort of stack or stack-substitute for saving/restoring local
variables and return addresses.
Since only a fairly small amount of stack space was required, my solution was to use
the MMX registers for this purpose.
- I would guess that a lot of developers, if presented with this same task, would
simply respond "No, that's not possible."
Experience
Independent Works in Progress (2014 - present)
- Computer Role Playing Game
- This is currently in the design phase with 9+MB of docs spread over 300+ text files.
- Music Composition
- 2500+ ditties and counting, some of which are of suitable style for the RPG
- Computing Reimagined
- This started out as a language+compiler project intended for use with the RPG, but has
become my primary focus, and its scope has expanded to essentially all aspects of
computing.
The project has yielded significant fruit, e.g. determining how to:
- perform both hardware and software optimizations that existing computer systems are
unable to do
- make CPU caches more efficient
- make improvements that involve "code breaking incompatibilities" (e.g. changing
syntax and semantics in a language) without actually breaking compatibility
- make code properly scale to a given platform (e.g. from 8 bit microcontrollers to
64 bit server boxes)
and has also led to fundamental transformations to the nature of source code, program
data, storage systems, drivers, editing, and computer graphics and audio.
Currently a limited version of the system directly generates MS-Windows x86-64
binaries, limited in part because many features can't be implemented on existing
operating systems.
This will be split up into two stages, with a platform neutral distribution format
being both the interface between the two and accepted by the below project.
- Personal Computer System
- The goal of this project is to design and implement a new personal computer system which:
- is an instance of "computing reimagined" and provides a sound platform to flesh out
the ideas involved
- just about anyone (with minimal soldering skills) can successfully assemble from
individual parts and understand its operation
- is relatively inexpensive while providing good capabilities (within the constraints
imposed by the previous bullet)
- addresses many of the failings of current personal computers (e.g. security,
resource management, complexity, transparency, respect and empowerment of the
computer's owner)
The custom CPU, GPU, APU (audio), and system controller are all off-the-shelf hardware
with custom firmware.
E.g. the CPU instruction set is fully custom, as specified by the firmware (which
translates the custom instructions to micro ops implemented by the hardware, caching
micro ops such that cached code operates at native hardware speeds - translation is
fast enough that it's on par with main memory access costs).
I have created a firmware-level emulator and disassembler for the CPU, GPU and system
controller hardware, and started on an assembler and "emulator vs real hardware"
validation tests.
The CPU instruction set has been worked out, as well as how key firmware challenges in
the CPU and GPU are to be addressed.
Schematic entry for the CPU, GPU and system controller sections is pretty much
complete.
Design of the power supply section is in progress.
Oracle Corporation (2004 - 2014), Principal Member of Technical Staff
- Took on the challenge of reimplementing Oracle's file-based JMS (Java Message Service)
provider, OC4J JMS, to improve its performance:
- Rewrote the file storage code: Made it a standalone general-purpose persistent object
store for problem decomposition, reusability, and independent testability. Used a
log-structured scheme, batching, and read reordering for performance. Created
extensive tests, including the simulation of various failure conditions (e.g. power
loss) to verify data integrity was maintained.
- Created a service which accepts work to be performed, assigns it to available worker
threads, and manages the addition and removal of threads. Its key feature is
performance, being 100% lock-free code, and using batching for reduced memory
contention.
- Rewrote the JMS logic to allow for the parallel execution of any number of JMS connections and sessions.
The rewrite resulted in performance many times that of the previous OC4J JMS implementation (multiple orders of magnitude in some high concurrency scenarios), and also superior to every competing JMS provider tested. The high coding efficiency meant that even while pushing more messages/sec than other JMS providers, the new OC4J JMS usually consumed less CPU (often ~75% less than some other JMS providers).
- Maintained and enhanced Oracle's J2EE Resource Adapter for plugging JMS providers (both
Oracle and 3rd party) into Oracle's application server.
A few such tasks include:
- Conceived and implemented the automatic generation of the required Java bytecode at
runtime to automatically expose new JMS provider functionality without the customer
needing to wait for Oracle to release an updated version of the JMS Connector
(patent US9456017B2).
- Conceived and implemented a tracing feature and trace analysis tool in order to quickly
perform failure analysis in a complex, stateful environment involving many players
(JMS Connector, various JMS providers, J2EE transaction manager, customer code),
without the need for a working reproduction case.
- Rewrote the most complex part of the JMS Connector (MDB endpoint handling) to address a
number of correctness issues.
- Wrote bare-metal proof-of-concept boot, debug, and networking code for an x86-64 based
computer, and tested its ability to process and respond to network requests. This
successfully demonstrated latency several times smaller than what Java achieves on the
exact same hardware.
Walking Robot (2005)
For this hobby project I designed and assembled the mechanics, circuit board, and software for an
8 legged walking robot.
It was entered into and took first place at the walking robot contest at PDXBot 2005.
(This was a timed contest, and my robot was multiple times faster than any of the other entrants.)
Related:
project description,
code description
Astronautics Corp. of America (1998 - 2004)
- Development (1999-2004)
- bootstrap (handles initialization of Pentium processor, Intel north-bridge and other
hardware; CRC checks and power-up tests)
- RS-422 and MIL-STD-1553 program/data loaders
- graphics routines (filled polygons, cubic curve fitting, flood fill, text, video, etc.)
for the 3D Labs G2/R4 chipset
- generic ARINC 429/702 base classes and many device-specific protocol handlers which are
derived from them
- a tool to parse .hpp files and dependency #pragma's, resolve function order
dependencies into a specific calling order, and auto-generate the top-level .cpp
file
- application code for a number of aircraft
- formal product documentation, user documentation and extensive module-level
documentation (interface and implementation)
- test and simulation code to aid development
- a number of other activities, ranging from programming an FPGA to catch COTS hardware
violating the PCI spec to debugging Linux kernel code
- Verification (1998-1999)
- Verification work for various aircraft instruments, mostly electronic flight
instruments for the Boeing VC-25 aircraft (Air Force 1), which involves: Reviewing
software requirements documents, writing test scripts, evaluating the results of those
tests, designing/performing hardware tests, and analyzing the C++ source along with
execution information generated by the McCabe tool-set to improve test coverage,
classify code as per FAA guidelines, and identify coding errors and other unspecified
behavior.
Tremplo Software Inc. / Stetzer Accounting (1992 - 1997)
- Money Tracker
- Created a C++, MS-Windows program based loosely on a previous product, Super
Spreadsheet (which was written in BASIC).
It uses the C++ library I created (see below).
The library's capabilities and efficiency allow the entire program to be about 8300
lines of code (of which I wrote about 4700 - the rest was written by a novice C++
programmer under my supervision), and have an executable size of about 380KB
(including the library).
- C++ Library
- Created a high-level library which allows the rapid development of Microsoft Windows
3.1 programs.
It is written in C++, with some 80x86 assembly.
It is customized for the needs of Tremplo Software programs which means an enhanced
user interface, a fixed-point class for reliable currency calculations, a
software-based 32-bit virtual memory system to keep program development costs low
while eliminating arbitrary limits, an array class supporting persistence and
polymorphism, B-Tree based database and iterator classes, as well as most of the
features from my BASIC library (see below).
- Client Program
- Created a networked scheduling & billing program with such things as user security
codes/clearances, bill/label/report/schedule generation, a variety of client
searches (including phonetic), user logging, and automatic retrieval/installation of
updated versions of the program.
It is written using MS-BASIC PDS 7.1 for MS-DOS and the BASIC library I created (see
below), which helps the executable size be about 500KB.
- BASIC Library
- Created a high-level library which allows the rapid development of MS-DOS programs.
It is written using MS-BASIC PDS 7.1, and provides a large variety of services such as
window management, menus, command bars, progress bars, list boxes, input fields, a
text editor, on-line help, manual generation with macro-expansion and formatting,
report generation functions, string and date manipulation including persistent
storage for variable length strings, encrypted licensing, and random-input (fuzz)
debugging.
- Program Maintenance
- I maintained the following programs:
- Single Entry Bookkeeper
- Payroll Summary
- W2-Generator
"Maintained" is probably too mild a word, however.
Before I started working there these programs would commonly crash and lose data, which
resulted in the programmers spending as much time doing tech-support as programming.
The programs used two-letter variable names, numbered GOSUBs, tons of global variables,
lots of copy-n-pasted code, and were basically a mess.
Also, the programmers did not use makefiles or keep logs of bugs, bug fixes, or other
program changes.
They were using MS-DOS to develop programs (i.e., no multi-tasking development
environment) and were using the QBASIC editor (which is basically identical to
MS-DOS EDIT).
In addition to fixing all of the above, I changed the programs to take advantage of the
BASIC library.
My work resulted in a near elimination of tech-support calls, about a 75% reduction in
source code (despite many added features), a much more consistent user interface,
and drastically reduced development times.
- I also maintained their network of about 25 computers.
Independent Work (1990 - 1993)
- The following projects were developed on and for the Tandy Color Computer 3 and were
marketed by Sundog Systems.
It is impossible for the reader to fully appreciate the skill and discipline necessary to
succeed at these projects without some knowledge of the limitations of that platform,
so I list some of them here:
- Based on the Motorola 6809 microprocessor running at up to 1.789 MHz.
- 128KB or 512KB of RAM.
- Manually paged memory (no automatic virtual address translation, no memory protection).
- No sound, video, or math coprocessor.
- Development environment:
- No hard-drive.
(They were available, but I never owned one.)
- Non-windowing/non-multitasking.
Had to quit the editor to assemble, quit the assembler to run.
- No facility for debugging besides running the program unchecked.
- No linker.
(The code had to link itself.)
- Variable and label names limited to 6 upper-case characters.
- Contras
- This action game was my first real experience with taking on someone else's mess.
What I got to start with was thousands of lines of assembly, an unknown number of bugs,
and a few after-the-fact hand-written pages of diagrams and notes describing what
the code did. Contras involved direct hardware access for memory paging, screen,
keyboard, and D/A converter (for sound and joysticks).
It also involved handling real-time interrupts, generating audio waveforms and
decompressing background graphics on-the-fly, and efficient handling of the
rendering and collision detection of a large number of software-driven sprites.
In the process of completing this project, I created a multi-level tile editor and
compression program (in today's terminology a "level editor").
- Photon
- For this action/strategy game I did everything from first-concept and design, to
artwork and music composition, to programming and testing.
The actual programming involved most of the things listed above for Contras.
Additionally, it involved on-the-fly decompression of audio (speech), and made use of
self-modifying code and a number of other tricks to keep the program efficient.
(1.789 MHz / 5 KHz (4-voice) audio only leaves 358 cycles per audio sample, most of
which has to be spent keeping up to 20+ software-driven sprites going at a 60 Hz
refresh rate and running the AI.)
In the process of completing this project, I created an integrated sprite
editor/compiler (which compiles sprites to 6809 assembly) and a music compiler
(which compiles music notation to a proprietary byte-code format).
- Photon was reviewed in the
May 1992 issue of Rainbow Magazine:
- "The cover of Photon's manual claims, 'It's too addictive,' and I have to agree.
It's the most addicting game I've played on the CoCo since Tetris"
- "Photon is a terrific arcade game in many ways -- but it requires more than
bang-bang reflexes to advance to higher levels.
It takes brain-power"
- "Photon has the mark of a classic game.
It's goal is easily understood, its controls are simple, but winning is devilishly
complex."
- "My recommendation: Addict yourself!"
- GrafExpress
- This graphics/sound system was my first independent, commercial product (produced under
the name Softronics Vanguard).
It offers a large collection of graphics and sound functions including windows,
scrolling, graphics primitives, sprites, collision detection, multiple resolutions,
double-buffered animation, and a music synthesizer.
It is the only program I know of which provides for 256 colors on the Color Computer 3.
The system is very fast (in some cases over 300 times faster than stock BASIC, and
that's not including BASIC's interpretation overhead) and quite compact
(only about 12KB).
GrafExpress also includes some sample applications: a couple of demos, miscellaneous
image editors, and a waveform editor.
- GrafExpress 1.0 was reviewed in the
August 1991 issue of Rainbow Magazine:
- "Softronics Vanguard has introduced its first offering to the CoCo community, and
what an offering it is."
- "I found it quite enjoyable and rewarding to use the GrafExpress system"
- GrafExpress 2.0 was reviewed in the
November 1992 issue of Rainbow Magazine:
- "As expected, the assembly-language programs performed faster than the basic
programs, but even the BASIC programs ran with incredible speed."
- "The documentation was far better than much of the documentation I've seen, and the
supporting programs and programming examples make it easy for the average
programmer to write his own applications."
Articles in Rainbow Magazine (1989 - 1992)
- I wrote and submitted for publication three articles to Rainbow Magazine (a
platform-specific magazine for the Tandy Color Computer).
All three were accepted and published:
- Add a Pause Switch -
June, 1992
- This article describes a simple method for adding a switch which would halt the
6809 processor, and thus any program.
- Data Transfer -
July, 1990
- This article describes and includes source for an action game.
The game was written in a combination of BASIC and 6809 assembly.
- Electro-Dominos -
June, 1989
- This article describes and includes source for a simple domino-simulation
program written in BASIC.
Viterbo University (1987 - 1991)
- Nutrition Program
- I wrote an MS-DOS program using GW BASIC for the Dietetics department.
It accepted information about an individual and their weight loss, muscle increase,
or body fat ratio goals and generated reports on required caloric intake,
time-frame, and expected results.
- Tutoring
- I tutored students in Modula 2 and Physics.
Academic Experiences
- Fuzz Revisited
(paper)
- Wrote a C++ program which interposes itself between an X application and the X
Windows server.
It is able to inject random but legal user input in order to test program
reliability.
It is also able to inject nonlegal input in order to test the server's reliability.
- PONI-Express
- Designed a RISC, non-pipelined, 16-bit, integer-only microprocessor at the gate
level using the Mentor Graphics tools.
It had a 14 gate-delay critical path, completed branch and memory instructions in 2
cycles, and all others in 1 cycle.
It was tested using QuickSim.
- Simple Compiler
- Built a simple compiler for a Pascal-like language targeting the MIPS R2000
architecture.
Everything from parser to code generator was written in C++.
Last Modified: April 30th, 2024
home: http://my.execpc.com/~steidl/
e-mail: steidl@execpc.com