Final Projects
 Late last year, I agreed to teach a master's course in data visualization at the City University of New York.
Late last year, I agreed to teach a master's course in data visualization at the City University of New York.
In a closely related story, my blogging dropped to zero over the last few months.
Anyways, the class was a great experience, and I think the students did an awesome job on their visualizations. You can see the full list of projects here.
Mapping the US Banking System with D3.js
 I snuck in one last data project for 2013: a d3 interactive map of the US banking system. You can play with it here. Please do and let me know what you think!
I snuck in one last data project for 2013: a d3 interactive map of the US banking system. You can play with it here. Please do and let me know what you think!
I was really happy with how the project came out. I also was really happy with Github Pages, which I tried out for the first time with this project. If you ever work with git, I can't recommend it enough. I'm hoping to move more of my projects there in the future.
A Startup's Minimum Revenue Per Employee
 In a couple of weeks I am going to be teaching a class on Data Visualization for Businesses (you should come!) and as part of the class prep I started thinking about key metrics that my students may want to visualize.
In a couple of weeks I am going to be teaching a class on Data Visualization for Businesses (you should come!) and as part of the class prep I started thinking about key metrics that my students may want to visualize.
After weighing some of the options, I settled on Revenue Per Employee, which has been on my mind recently. I want to understand what is the minimum revenue per employee that quickly growing companies can sustain?
I decided to put together a quick plot of a few key stats based on the 2013 Inc. 5000 List. You can play with the interactive chart here. In his excellent SaaStr blog, Jason Lemkin points out that everybody lies about business info, but agrees that this data is probably in the ballpark.
I took all of the data and threw it into a Google Motion Chart, which I find to be the easiest way to quickly inspect data across a bunch of different dimensions (though not as pretty or powerful as d3). I only included Advertising, Media, and Software companies, since I believe their major expenses are people and many of those start-ups are pulling from similar pools of people.
Google Charts lets you easily play with which data is displayed on each axis, so it's easy to look at the data a bunch of ways. First I looked at Revenue Per Employee vs. Number of Employees, but all this showed was that some of the companies are either really crushing it or may need to reconsider their reporting.
I played around with the data in a few other combinations (you should too here). There's plenty of interesting stuff in the data. For example, you can clearly see the $2 Million revenue cut-off to get into the list when you look at revenues per employee.   The graph of revenue per employee below shows that the average fast growing company in these areas is generating $230k - 310k per employee in revenue. For fast growing companies with over $2 million in revenue, that seems to be fairly consistent no matter how many employees the company has. But software companies (in yellow) lag far behind the other industries on this metric: they typically generate 25% fewer revenues per employee, and a few smaller ones are operating on under $100k per employee. I believe much of that is due to the increased interest in funding software companies over the last few years (allowing money-losing companies to grow), though some might also be due to lower operating costs.
The graph of revenue per employee below shows that the average fast growing company in these areas is generating $230k - 310k per employee in revenue. For fast growing companies with over $2 million in revenue, that seems to be fairly consistent no matter how many employees the company has. But software companies (in yellow) lag far behind the other industries on this metric: they typically generate 25% fewer revenues per employee, and a few smaller ones are operating on under $100k per employee. I believe much of that is due to the increased interest in funding software companies over the last few years (allowing money-losing companies to grow), though some might also be due to lower operating costs.
This is actually the exact opposite of what I expected. I assumed that software companies would be more scalable and generate more revenue per employee, but the numbers don't bear this out.
You can see all the code here. Follow me on Twitter if you're interested in future projects/posts.
Chicago's Finances
 I wrote a guest post on Chicago's finances for Lumesis (which I have worked with and am an investor in). You can find the whole post here.
I wrote a guest post on Chicago's finances for Lumesis (which I have worked with and am an investor in). You can find the whole post here.
A couple of additional thoughts on Chicago:
- Chicago's pension deficit grew by almost $3 billion last year. The whole city budget was under $6 billion!
- I have a hard time imagining the city ever declaring bankruptcy, but it is being run by a famously confrontational negotiator. I have to imagine that if Emanuel gets a second term and the city's finances continue to slide, it will be on the table.
- I compared Chicago to Detroit, but really Chicago's finances look more like Vallejo or Stockton in many respects, just on a different scale. Mismanagement and pensions are bigger issues than demographic change.
- In parsing data on Chicago, we found obvious errors in the financial reporting of several major American cities. This probably deserves its own post at some point.
- According to the city's presentation to investors on their pension funds, two of the four pension funds will run out of cash in the 2020's "If nothing changes and current law is followed". I have written before about zombie cities and I do fear that future for Chicago.
Why Detroit’s Bankruptcy is a Bigger Deal Than You Think
(disclosure- I am involved with and invested in Lumesis, which sells Muni credit & compliance software)
 So Detroit filed for bankruptcy last week.
So Detroit filed for bankruptcy last week.
If you live in the US and don’t work in municipal finance, you probably barely registered this news.
And you probably think that it’s not that surprising that something bad happened in Detroit. I mean, during the recession a house there cost less than a car. So we all saw this coming, right?
Well, yes and no.
Yes, everyone in the muni market knew that Detroit was a poor credit. It’s not a surprise- that’s not why it’s a big deal.
The default is by far the largest ever for a municipality, as the Times can show you:
$18 billion (or more) is nothing to sneeze at, but in the context of the $600 billion pension underfunding and the $3 trillion muni market, it’s minor. This will hurt a few investors badly (including my old coworkers at Ambac), but that’s not why this is a big deal.
Detroit’s (modest) pensions put about $3 billion of payments to retirees at risk. It is very possible that thousands of pension-holders will lose a substantial part of their future income and be plunged into poverty. It’s horrible, but that’s not why this is a big deal.
Detroit’s bankruptcy filing is a big deal because it marks a fundamental change in the relationship between cities and those who finance their projects.
Imagine being a city (or county, or state) administrator. You collect taxes and pay your police and firefighters. You have inherited a pension fund from the previous administrator, and you are legally obligated to fund it. And you raise money from the bond market (and others) to finance projects like building a new school or a new road, and you pay the bondholders back over a decade or more.
Imagine that, suddenly, you lose a bunch of revenue. Maybe there’s an exodus of people, reducing your income and sales tax collections. Maybe real estate price drops decimate your property tax base. Maybe the Federal Government reduced payments you depended on? Or maybe you (or the last administrator) spent recklessly or was corrupt or just invested in projects that didn’t work out. The reason doesn’t matter. You’re broke.
You have a bunch of options. You can ask for aid, though that gives others a lot of leverage over you. You can cut or freeze spending, though that makes your city less pleasant to live in, further decreasing property value. You can ask your workers to share the pain, but there’s only so far that will take you. And remember, those union members pay taxes and vote. You can stop funding your pension, but there’s only so long you can live on your credit.
And then there are your bondholders.
"I used to think if there was reincarnation, I wanted to come back as the president or the pope or a .400 baseball hitter. But now I want to come back as the bond market. You can intimidate everybody." –James Carville
Bondholders, in some respects, are the easiest stakeholders to push the losses to when things go bad. They have fewer votes than the union and won’t move out of your city if you default. The bondholders in your town may even be insured by a large company such as Ambac or Assured, so you might lose zero votes for declaring bankruptcy and sticking these insurers with the bill.
The bond market knows this and has a bunch of tools to handle this: securing bonds, reserve funds, credit support and seniority, to name a few. But the most powerful tool is known as market discipline.
Basically, you can think of market discipline in the bond market as tit-for-tat: if you don’t pay me now, I won’t lend you money in the future. Game theory 101. But it goes way beyond that.
Muni lenders and insurers have traditionally been able to put pressure on localities through their states. If Michigan let one city treat its bondholders poorly, bondholders will demand higher interest from every other city in Michigan looking to raise money, as well as the State itself. This way bondholders can leverage their status as a major source of the entire State’s financing.
This is what Carville is talking about when he says that the bond market is intimidating. States that have a reputation for being less friendly to bondholders (like California) pay higher interest rates than those with better reputations. This is also part of the reason that issuers look at different yield curves for each State. States are anxious to keep finance costs low for themselves and their communities, so they are usually ready to play ball with financers’ demands.
And THIS is what is important about Detroit. The fact the city doesn’t want to pay is no surprise. But the role of the State and Governor has been.
“Realistically, if you step back, if you were lending to the city of Detroit in the last few years, didn’t you understand there were major issues and problems?... Look at the yields they’re obtaining compared to other bonds. They were getting a premium.” - Michigan Governor Rick Snyder on Face the Nation
This. A prominent Republican Governor, (who has a finance background, no less) and is considered a 2016 presidential candidate, is going on national TV and basically saying that the bondholders had it coming. The State’s emergency manager has floated a plan that would give some bondholders about 10c on the dollar.
This is not playing ball. And this shift in attitude has been the primary concern of investors who have been shunning the market.
'When faced with large revenue shortfalls, communities that have all of their bonds insured will be more prone to develop “solutions” less favorable to bondholders than those communities that have uninsured bonds held by local banks and residents. Losses in the tax-exempt arena, when they come, are also likely to be highly correlated among issuers. If a few communities stiff their creditors and get away with it, the chance that others will follow in their footsteps will grow. What mayor or city council is going to choose pain to local citizens in the form of major tax increases over pain to a far-away bond insurer?' –Warren Buffett, 2009
This is bad news for everyone. Even if Detroit’s filing is rejected by the courts (a possibility) thousands of city and state administrators dealing with similar issues on a smaller scale will be following this. At the very least, trust between cities and lenders will fray, requiring investors to spend more time and money analyzing and monitoring their credits, which will push up costs. And higher borrowing costs can’t be good for already stressed budgets.
And if Detroit emerges in 2-3 years in better shape (and really, how much worse shape could it be in?) how strong will the temptation be to restructure, or at least use the threat of restructuring to obtain concessions from lenders?
If many more municipalities default in the future, it is likely to impact you in some way, whether your pension or a family member’s job or the services that your community is able to provide.
This won’t come soon. Nothing in the Muni market moves fast, and smarter people than I have looked bad predicting downturns. But I think that this, truly, is the opening round in a new era for the market.
Analyzing ‘Analyzing the Analyzers’: an Analysis
Analyzing the Analyzers is a recently published report by Harlan Harris, Sean Patrick Murphy and Marck Vaisman, documenting the results of a 2012 survey of ‘several hundred’ data scientists.

The report is free and just 25 pages of text, plus an appendix- you should read it.
The authors’ central contention is that there is not one set of skills that organizations should look for in a data scientist. Instead, there are four distinct skill groupings that you will find in the ‘data science’ world:
- Data Businesspeople: managers primarily focused on their organization and the bottom line
- Data Creatives: hackers who feel comfortable with the entire data pipeline, from extraction to presentation
- Data Developers: back-end and infrastructure engineers primarily working on data extraction, storage, and scale issues
- Data Researchers: academics, usually with a strong background in statistics
(if you are interested in the skill sets considered and how you would be categorized, you can check out the authors’ quick survey).
The report goes on to cover career path issues and the distribution of skill that data-scientists have. Again, if you’re interested in these things, I recommend you spend the 30-40 minutes and read the report.
Unfortunately, it seems like the authors didn’t make any data available for us to play with or check their work (come on guys, know your audience!) but I certainly agree with their main point: ‘Data Scientist’ as a title, isn’t particularly useful.
@revodavid the main problem with DS title is that EVERYBODY wants to use it. it's so diluted that i put "engineer" on my biz card instead.
— Adam Laiacano (@adamlaiacano) May 16, 2013
@adamlaiacano @revodavid Yep, that is why I don't use the title anymore. Lost its meaning. My engineering title is more specific.
— Ryan Rosario (@DataJunkie) May 16, 2013
Kaggle now has 100K data scientists, but what's a data scientist? http://t.co/tRGpd4Q97Z
— GigaOM (@gigaom) July 11, 2013
So are these four groups the right way to think about data science? One thing that jumped out at me is that they seem to match up with my experience with the data pipeline.
- Development is needed for Extraction and Storage of the data
- Research is needed for finding patterns in the data
- Analysis and Presentation of the data are needed to drive decision-making
- Coordination and Productization are need to actually extract value from all this work
Though I think about the roles differently, these four sections of the data-to-product pipeline seem to match the authors' categories. And from the beginning, it has been clear that 'data-science-whatever-that-is' requires a bunch of talents: see DJ Patil's thoughts about building a data science team. Since DJ Patil allegedly came up with the term 'data scientist' in the first place, this seems appropriate to me, though I doubt we'll see people moving en-mass away from the 'Data Science' moniker.
For the record, I felt the need when I started this blog to stress that I'm not a scientist.
Moving from tumblr to Wordpress
Just a note to the readers of this blog: you may have noticed that something is different. Like, everything is different...
This past weekend I migrated this blog from tumblr to wordpress. I did this because tumblr asked everyone with their own url to change their 'A record' or CNAME. This was fine, so i did as instructed.
Unfortunately, the change broke the display of everything else on my domain, particularly my projects, which I am quite fond of. I'm sure this is my fault and not tumblr's, but it was a good forcing function for me to make some long-needed changes to the site anyways.
The transition was shockingly easy and can be done on wordpress.com with this handy importer. I need a little more control over my site, so I'm actually using wordpress.org, but the same functionality is available via a plugin.
Don't get me wrong - I still love tumblr and have a bunch of sites on there. But for the way I write, (wordy, lots of links, generally obnoxious) I feel more at home on wordpress: the experience seems smoother.
The knock, unfortunately, is that wordpress doesn't look as good as tumblr, generally. I'm playing with the site's theme/css etc until I get it looking the way I like it, but that will probably take some time. I also need to fix a bunch of broken links, but, eh... not high priority.
The Dashboard Lifecycle
I once asked my brother, who studied large organizations, what was more effective- the hierarchical, top-down organization of, say, Apple, or the distributed decision-making of, say, Urban Outfitters.
My brother said “both”.
Apparently, the best way to capture the benefits of hierarchies (order, coordination) and delegated authority (reaction speed, creativity) was to cycle between the two. There was generally no single best system for any one organization, not even for very large organizations with stable missions.
Change was best. Even though it imposes high switching costs, change is best.
That conversation occurred to me this week as I looked at the dashboard that I provide my team, updating them on the state of our business. My dashboards generally shift from being very simple to being much more complex, until we all agree it’s time for a different look and we burn them down again.
Now I’m starting over again with a new dashboard, and I’ve realized that this process has repeated itself enough that I really recognize a cycle, which I’m calling, super-creatively, ‘The Dashboard Life-Cycle’. It goes like this:
Creation: It starts very simply. “What are our three top priorities or KPIs and how are we measuring them?” A first cut of a dashboard might be as simple as 3-5 numbers, tracked over time. People look at it and say “That will do for now, I guess”. I always think of the dashing demo as a fine example of a dashboard in this stage.

This stage usually lasts only a month or two. It doesn’t take too long before your data consumers start asking things like “Is that number good or bad?” and “How does that compare to last year/our goals/our competitors/industry average/whatever?” So then you move to the next stage…
Context: Contextual dashboards are heavy on the line-charts and bar-charts, depending on whether you’re comparing your product to last week or to the competition. The iconic Google Analytics chart has about 130 data points on it and is a pretty good example of a contextual dashboard.

People look at it and say “That will do for now, I guess”. But this level of analysis can take an organization a long way, illustrating trends or market strengths clearly. However, your data consumers realize that these charts aren’t really giving them any idea why numbers are up or down, or why you’re gaining or losing market-share. So they ask “What’s driving this change?” or “How can I tell what segments are we lagging with?” are typical questions after looking at a board like this. So we move to stage 3…
Category: In category or segement dashboards, the constituent parts of your numbers take center stage. Here you’re no longer looking at the company’s top priority numbers from the first stage unless you’re summing up the different groups somewhere. Mostly you’re looking at different groups to see if there are any outliers. I usually see 250 or so data points on these dashboards. To give a sense of complexity, this dashboard is a pretty good example.

This is where you start getting to the point of needing to sub-divide the dashboard into different views. People can only interpret so much data at once. Some data consumers start to tune the dashboard out at this point, instead just coming up and asking about info that’s readily available on the board. Others are really unsure about what is included in one group versus another (ie “When does someone move from lead to contact?” “How do targets fall out of our funnel?”). People who aren’t really paying attention look at it and say “That will do for now, I guess”.
At this point you’re probably better off just burning down to a simpler board. But occasionally, you’ll have data consumers that want the highs and lows highlighted, or some additional number added in. So you move to the final stage…
Chaos: This is the deluxe, extra-coffee-holder-and-seat-warmer version of a board. Here’s one with well over 400 data points on it. Multiple displays of the same data points, sparkline abuse, and tiny type are all signs of a dashboard deep in the Chaos stage.

Only the most dedicated consumer can make it through this. You’re frequently asked about what is going on, what the numbers are, and “Is there a summary or something?”
So you start over. And the cycle begins again.
Can I fix this? Should I fix this?
For a long time, I was frustrated with the amount of time I spent conveying metrics to my team. I have wondered if there was a better way to approach these displays (and I’m sure there is one). But I’ve come to appreciate that cycling my dashboards, while costly in terms of time, is actually valuable in terms of my team’s understanding of what’s going on, for a few reasons:
Avoid Habituation: If there’s an email I get every week, or a monitor I pass every day, after a while I start to tune it out. By shaking things up, I can keep things a little bit fresher and get a lot more engagement from the team.
Easier to Try New Things: With the expectation that the dashboard will be replaced every few months, there is lots of opportunity to look at new indicators and see how they trend over time. I often think that KPI selection is heavily path dependent, so multiple opportunities to look at trends gives you a much better chance of really finding some insight.
Less Sensitive to Criticism: This may be unique to me, but I know that after I work on a project for a while, I get cranky when people nit-pick or ask me to add things. By making each iteration of my dashboard lower-stakes, it’s a lot easier for me to openly and honestly assess what is working and what is not, and the make changes accordingly.
And after all, change is best.
The Real Issue With Excel

It has been over a week since we discovered that Carmen Reinhart and Kenneth Rogoff made some unusual modeling choices and an error in key calculations in their seminal paper Growth in a Time of Debt.
So far, Microsoft Excel has taken a lot of the criticism, with the damning image above used to prove the point. The press has had no problem coming up with other examples of Excel errors resulting in serious costs. Though I do think Ars Technica takes the cake with this image:
 
 
I’ve spent plenty of time working on financial models in Excel. In fact, I helped write a book about it. And yes, before you ask, there are (a small number of) errors in the book. My co-author keeps a list of errata for all of his books accessible online, so feel free to check it out and make fun.
However, I now work primarily in R and Python, using each both accounting/forecasting purposes (as well as a bunch of other things, of course). I’ve even worked with SAS on one painful occasion. So I have at least a little perspective to opine on Excel and alternatives.
First, let me say that Excel, as a piece of software, is not the culprit here. My nerd cred be damned, finding the average of a small data set like this is as appropriate a use for Excel as you’re going to find, especially where the data may be updated asynchronously from multiple different sources by hand, and new rows may need to be added piecemeal. Plus, it’s not like other approaches to data work are error-free. Excel has parts that I hate (type errors, propagating rounding, MERGED CELLS… the horror…) but mis-identifying your sum ranges is more sloppiness than anything.
I do appreciate Cathy O’Neil’s commentary on the situation, especially her call that this information should be openly available. And I think that’s a reasonable in an academic setting. Much like Paul Graham’s view that death is the default for startups, I think errors are the default for analysis. It takes a fair bit of effort to make sure that analysis is done right, and generally the best way to do that is to have lots of eyes on the process. And Excel’s biggest drawback is that it doesn’t make it easy for non-authors to see the process.
That is the real difference between Excel and programmatic approaches to analysis. Visually, Excel is what I would call ‘state-oriented’ (I’m sure a case can be made that I’m using ‘state’ incorrectly in a CS sense, but bear with me). Look at the spreadsheet above: only by clicking on the cell do you actually see what operations are being performed. Mostly, you see data. Compare this to what’s on my screen if I did a similar analysis in R:
myData <- read.csv(file=oddly_weighted_GDP_growth_numbers.csv)
output <- mean(myData$90.or.above)
print(output)
Is this cleaner? Yes. Can I see the calculations? Sure! But what do I know about the data here? What’s in front of me most of the time is my code. I would call this ‘function-oriented’ (even though my pseudo-code isn’t ‘functional’ in the least). I certainly can view the state of the data in R, just like I can see the functionality in Excel by opening the editor or clicking on a cell, but that’s not playing to each approach’s strength.
I try to be conscious of this difference when I’m working on analysis projects. In general, the less good I feel about my data, the more likely I am to spend time in Excel. I feel closer to the numbers in that program. On the other hand, if I am more concerned about the steps I’m taking in the analysis, I’ll use R or Python to iterate through different types of analysis and evaluate what’s working and where my errors are.
$1 Trillion of Student Loan Debt
Today I learned that student loans recently surpassed credit-cards to become the largest non-mortgage source of debt for US consumers. Not only that, in the next quarter or two, total outstanding debt will ominously hit 1 Trillion Dollars.

The remarkable thing about this is not just how big student loan debt is, but how much it has grown in the last few years.

(chart shamelessly stolen from alphaville)
Two things jump out at me:
1- while the other types of debt seem to be cyclical, student loan balances seem to have climbed at a pretty steady pace, in good times and bad.
2- in 8 years, student loans have almost quadrupled in size!
Obviously, some part of this is demographic: there have been more kids graduating as the ‘millenials’ grow up. But the change has been about 11% over the past 7 years.

(data from the NCES, complied by me)
The NY Fed has a helpful page on this as well.
So how much is $1 Trillion? it’s:
- over $3,000 per person in the US.
- about $25,000 per student with debt
- larger than the GDP of Indonesia
- larger than the entire construction industry’s impact on US GDP.
To be fair, I think much (certainly not all) student debt is warranted. Though being a lawyer isn’t what it once was, and there are smart people calling education a bubble, I know I got a lot out of university and I imagine most other people who went did as well.



