About three weeks ago, I had the opportunity to sit down with Bill Bain of ScaleOut Software and the two Joes, Joe Cleaver and Joe Rubino, from Microsoft’s Financial Services Industry Evangelism team after I gave my presentation on distributed caches at Microsoft’s 6th Annual Financial Services Developer Conference. The two Joes recorded a podcast of our conversation.

Bill, Joe, and Joe, thanks for the opportunity to talk with you guys.

Dataflow is about creating a software architecture that models a problem on the functional relationship between variables rather than on the sequence of steps required to update those variables. It’s about shifting control of evaluation away from code you write toward code written by someone else. It’s about changing the timing of recalculation from recalculate now to recalculate when something has changed. Sure, it’s a distinction that may have more to do with emphasis and point of view than with paradigm, but it can be a liberating distinction for certain problems in financial modeling.

If you work in finance, chances are you may already be expert in today’s preeminent dataflow modeling language: Microsoft Excel. Excel is the undisputed workhorse of financial applications, taught in every business school, run on every desk, wired into the infrastructure of nearly every bank, fund, or exchange in existence. The reason for Excel’s singularity in the black hole of finance is its ability to emancipate modeling from code (and thus developers) and empower analysts and business types alike to create models as interactive documents. Make no mistake — writing workbooks is still very much software development. But Excel’s emphasis on data rather than code, relationships rather than instructions, is something that fits with the work this industry does and the people that do it.

Briefly, when you model in Excel, you specify a cell’s output by filling it with either a constant value or a function. Functions are written in a lightweight language that allows function arguments to be either constant values or references to another cell’s output. In the typical workbook, cells may reference cells that in turn reference other cells, and so on, resulting in an arbitrarily sophisticated model that can span multiple worksheets and workbooks. The point though is that, rather than specifying your model as a sequence of steps that get executed when you say go, here you describe your model’s core data relationships to Excel, and Excel figures out how and when it should be executed.

Example: An Equities Market Simulation

Let’s say that we are writing a simulation for an equities (stock) market. Such a simulation could be used for testing a trading strategy or studying economic scenarios. The market is comprised of many equities, and each equity has many properties, some that change slowly over time (such as ticker symbol or inception date), and some that change frequently (such as last price or volume). Some properties may be functions of other properties of the same equity (such as high, low, or closing price), while others may be functions of properties on other equities (such as with haircuts, derivatives, or baskets).

As a starting point, we introduce a simulation clock. Each time the clock advances, the price of all equities gets updated. To update prices, we use a random walk driven by initial conditions (such as initial price S0, drift r, and volatility σ), a normally distributed random variable z, and a recurrence equation over n intervals of t years: 

S_{n} = S_{n-1} \cdot \exp(r t - 0.5 \sigma^2 t + \mathbf{z} \sigma \sqrt{t} )

Note: This equation provides a lognormal random walk [1,2], which means that instead of getting the next price by adding small random price changes to the previous price, we’re multiplying small random percentages against the previous price. This makes sense for things like prices since a) they can’t be negative, and b) the size of any price changes is proportional to the magnitude of the current price. In other words, penny stocks tend to move up and down by fractions of a penny while stock trading at much higher prices tend to move up and down in dollars.

In Excel, you could model this market by plopping the value of the clock into a cell, setting up other cells to contain initial conditions, and then have a slew of other cells initialized with functions that reference the clock and initial conditions cells and that calculate a new price using the above equation for each virtual equity. And then hit F9.

But how would you write this in code? Would you just update the clock and then exhaustively recalculate all of the prices? If you had to incorporate equity derivatives or baskets, would your architecture break? How would you allow non-programming end-users to declaratively design their own simulation markets and the instruments within?

Recently, one of our financial services clients at Lab49 has been trying to solve a similar problem in .NET, and I had been suggesting to them that the problem is analogous to how Microsoft Windows Presentation Foundation (WPF) handles the flow of data from controller to model to view. Dependency properties, which form the basis of data binding in WPF applications, implement a dataflow model similar to Excel, and what I had in mind at first was a solution inspired by WPF. But the more I discussed this analogy with the client, the more I realized that we didn’t just have to use WPF as inspiration; we could actually use WPF.

In this series, I’ll dive further into creating the equities market simulation and look at how to use WPF data binding to create a dataflow implementation. Note that there are several considerations to this approach, and, under the category of just because you can doesn’t mean you should, we’ll evaluate whether or not this method has legs.

[to be continued]

The Marc Jacobs Utilization Meter has been pegged for at least two weeks now on a combination of client work, internal projects, recruiting, and writing (hence the appearance of my blog having fallen down a well.) It’s great to be busy, but I hate seeing the blog go stale.

In any event, I had an article published in GRIDtoday this morning entitled, “Grid in Financial Services: Past, Present, and Future”. Derrick Harris, the editor of GRIDtoday, reached out for an article after reading my multi-part series on “High Performance Computing: A Customer’s Perspective”. A big thanks to Derrick for giving me this opportunity.

Excerpted from a paper I delivered on January 16, 2007 at the Microsoft High-Performance Computing in Financial Services event in New York.

In Closing

It’s a very exciting time to a proponent of high-performance computing in finance. Right now, it’s still a rather rugged task and evangelizing such rough solutions can sometimes result in sour impressions, but overall it’s getting easier to make it work all the time. With all the new products and vendors entering the market right now, I’m convinced we’ll scaling out with ease in the coming years. But in the meantime, we have to be vigilant in ensuring that vendors understand our business and developers and that they bring to market the tools and guidance that allow to keep prioritizing business first and technology second.

Excerpted from a paper I delivered on January 16, 2007 at the Microsoft High-Performance Computing in Financial Services event in New York.

Help Me Help You Help Me

To drive home the point, our trading and portfolio generation systems at Bridgewater have been parallelized and distributed for some time, based on a series of proprietary technologies that a) were not that great, b) lacked many features, and c) probably shouldn’t have been written in the first place. Along the way, we used DCOM, COM+, and .NET Remoting. We wrote custom job schedulers and custom deployment processes. We leveraged virtualization, disk imaging, multicast networks, message queues, even Microsoft Application Center. Each time, we managed to stack up the available Lego pieces and make a nice little tower out of it. But, typical of enterprise development projects that supply infrastructure rather than specific line-of-business value, they lacked for amenities. The APIs were never sufficiently developed or documented, the monitoring and administration tools often required black art skills, and the user interfaces, if present at all, bordered on sadistic.

Each time we revisited this situation, we knew that we shouldn’t be writing this stuff. We shouldn’t have had to. We knew that it wasn’t our core expertise and that we would never devote enough developer resources to give it professional polish. In reality, there are always just too many real projects to work on. The problem was that, until recently, there just weren’t any off-the-shelf packages for developing distributed applications on the Microsoft platform. For various reasons, we weren’t going to start invading our IT infrastructure with Linux just to use half-baked open-source solutions. So we rolled our own. Again. And again.

Then, a little over a year and a half ago, I read a news brief about Digipede Network in the now defunct Software Developer Magazine. It advertised a commercial grid computing solution built entirely on Microsoft .NET and running on Microsoft Windows. We downloaded the eval, read the APIs. After a brief meeting among lead engineers, we decided to do a test port, just a dip of the toe to see how much effort it would take to switch to a commercial solution.

Let me tell you. The whole port to Digipede, not just the acid test but the whole port, took one developer (me!) exactly three days from start to finish. After just an additional two weeks of procurement, deployment, and testing, we went live. And it has been working great.

That’s the kind of help we need. We need books and articles aimed at the broader market, not at the ACM or IEEE, that show just how easy it can be. We need our vendors to wrap up the hard stuff and leave the samples, tools, and guidance so that we can just markup our business logic and plug it in to a high-performance computing infrastructure in a week.

Introducing the Windows Execution Foundation

Microsoft .NET Framework 3.0, despite the unfortunate naming confusion, brings with it a tantalizing mix of technologies that are just waiting to be composed into high-performance computing framework for.NET. Building on the power of the Windows Communication Foundation and the Windows Workflow Foundation, we could solve the four big technical vacuums in financial high-performance computing:

  1. Job Deployment
  2. Job Security
  3. Pool Management
  4. Scalable I/O

Such a framework, let’s call it a Windows Execution Foundation, would have several features:

  1. Declarative Parallelism
  2. Distributed Concurrency and Coordination Constructs
  3. Distributed Shared Memory and Object Caches
  4. Lightweight File Swarming
  5. Lightweight Message Bus
  6. On-demand Pool Construction and Node Addressing

Armed with this kind of technology, the financial industry could focus on the business, not the technology. We could achieve high-performance computing without having to understand every relevant implementation detail. We could wrestle back control of our own scalability story from our IT departments and solve our scalability problems with software.

[continue]