Author Archives: Eric Novik

About Eric Novik

Software Entrepreneur, Options Trader, Skeptic, but not necessarily in that order.

To plot or to ggplot, that is not the question

Producing informative and aesthetically pleasing quantitative visualizations is hard work.  Any tool or library that helps me with this task is worth considering.  Since I do most of my work in R, I have a choice of using plot, the default plotting library, a more powerful lattice package, and ggplot, which is based on the Grammar of Graphics.

There is usually a tradeoff between the expressiveness of the grammer and the learning curve necessary to master it. I have recently invested 3 days of my life learning the ins and outs of ggplot and I have to say that it has been most rewarding.

The fundamental difference between plot and ggplot is that in plot you manipulate graphical elements directly using predefined functions, whereas in ggplot you build the plot one layer at a time and can supply your own functions, although you can do quite a bit (but not everything) with a function called qplot, which abstracts the layering from the user and works similar to plot.  And therefore qplot is exactly where you want to start when upgrading from plot.

To demonstrate, the following R code partly visualizes the famous iris dataset containing Sepal and Petal measurements of three species of Iris flower using the built in plot function.

par (mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.012, las=1)
with(iris, plot(Sepal.Length, Sepal.Width, col=as.numeric(Species)+1, pch=20))
lbs = levels(iris$Species)
legend('topright', legend=lbs, 
       col=2:4, cex=0.7, pch=20, box.lwd=0.5, pt.cex=0.6)

One of the problems with plot is that the default plotting options are poorly chosen, so the first line of code fixed the margins, tick marks, and the orientation of the y axis tick labels.  The parameter col=as.numeric(Species) + 1 fixes the color offset at Red as opposed to the default Black.  Type palette() at the R prompt to see the default color vector.

The last complication is that plot does not draw the legend for you; it must be specified by hand.  And so, if you run the above code in R, you should get the following output.

It took a little bit of work, but the output looks pretty good.  Following is the equivalent task using ggplot’s qplot function.

qplot(Sepal.Length, Sepal.Width, data = iris, colour = Species, xlim=c(4,8))

As you can see, ggplot chooses a lot more sensible defaults and in this particular case, the interface for specifying the intent of the user is very simple and intuitive.

A final word of caution.  Just like a skier who sticks to blue and green slopes is in danger of never making it out of the intermediate hell, so is the qplot user will never truly master the grammar of graphics.  For those who dare to use a much more expressive ggplot(…) function, the rewards are well worth the effort.

Here are some of the ggplot references that I found valuable.

 

 

 

A Better Way to Learn Applied Statistics, Got Zat? (Part 2)

Earning a PhD for DummiesIn the second semester of grad school, I remember sitting in a Statistical Inference class watching a very Russian sounding instructor fast forward through an overhead projected PDF document filled with numbered equations and occasionally making comments like: “Vell, ve take zis eqazion on ze top and ve substitude it on ze butom, and zen it verk out.  Do you see zat ?”  I did not see zat.  I don’t think many people saw zat.

In case I come off as in an intolerant immigrant hater, let me assure you that as an immigrant from the former Soviet block, I have all due respect for the very bright Russian and non-Russian scientists who came to the United States to seek intellectual and religious freedoms.  But this post is not about immigration, which incidentally is in need of a serious reform.  This is about an important subject, which on average is not being taught very well.

This is hardly news, but many courses is Statistics are being taught by very talented (and sometimes not so talented) Statisticians who have no aptitude or interest in the teaching method.  But poor instructors are not the only problem.  These courses are part of an institution, an institution which is no longer in the business of providing education.  Universities predominantly sell accreditation to students, and research to (mostly) federal government.  While I believe that government sponsored research should be a foundation of a modern society, it does not have to be delivered within the confines of a teaching institution.  And a university diploma, even from a top school (i.e. accreditation), is at best a proxy for your knowledge and capabilities.  For example, if you are a software engineer, Stack Overflow and GitHub provide a much more direct evidence of your abilities.

With the cost of higher education skyrocketing, it is reasonable to ask if the traditional university education is still relevant?  I am not sure about Medicine, but in Statistics the answer is a resounding ‘No.’  Unless you want to be a Professor.  But chances are you will not be a professor, even if you get the coveted PhD.

So for all of you aspiring Data Geeks, I put together a table outlining Online Classes, Books, and Community and Q&A Sites that completely bypass the traditional channels. And if you really want to go to school, most Universities will allow you to audit classes, so that is always an option. Got Zat?

Online Classes Books Community / Q&A
Programming Computer Science Courses at Udacity. Currently Introduction to Computer Science, Logic and Discreet Mathematics (great for preparation for Probability), Programming Languages, Design of Computer Programs, and Algorithms.

For a highly interactive experience try Codecademy.

How to Think Like a Computer Scientist ( Allen B. Downey)

Code Complete (Steve McConnell)

Stack Overflow
Foundational Math Singel Variable Calculus Course on Coursera (they are adding others; check that site often)

Khan Academy Linear Algebra Series

Khan Academy Calculus Series (including multivariate)

Gilbert Strang’s Linear Algebra Course

Intro to Linear Algebra (Gilbert Strang)

Calculus, an Intuitive and Physical Approach (Morris Kline)

 

Math Overflow
Intro to Probability
and Statistics
Statistics One from Coursera. This course includes an Introduction to R language.

Introduction to Statistics from Udacity.

Stats: Data and Models (Richard De Veaux)

 

Cross Validated, which tends to be more advanced

 

Probability and Statistical
Theory
It is very lonely here…

 

Introduction to Probability Models(Sheldon Ross)

Statistical Inference (Casella and Berger)

Cross Validated
Applied and Computational
Statistics
Machine Learning from Coursera.

Statistics and Data Analysis curriculum from Coursera.

Statistical Sleuth(Ramsey and Schafer)

Data Analysis Using Regression and Multilevel Models (Gelman)

Pattern Recognition and Machine Learning (Chris Bishop)

Elements of Statistical Learning (Hastie, Tibshirani, Friedman)

Stack Overflow especially under the R tag

New York Open Statistical Programming Meetup, try searching Meetups in your city

Bayesian Statistics Not to my knowledge, but check the above mentioned sites. Bayesian Data Analysis (Gelman)

Doing Bayesian Data Analysis (Kruschke)

 

I dont know of any specialized sites for this. (Opportunity?)

Don’t Judge Your Ecuadorian Chef or How to Learn Statistics (Part 1)

There is a small Ecuadorian cafe right outside of my apartment building.  The beauty of living in Astoria Queens is that delicious, reasonably priced food is just a few blocks away.  The owner, Claudio, is a good buddy of mine and cooks up amazing dishes like Pernil, Ceviche Camaron, and other Ecuadorian goodness.  It is usually quite warm inside, with South American music playing in the background and clunking sounds of dishes and cutlery filling in the rest of the spectrum.

Beef Stew, Ecuadorian Style - $8 with Soup

Claudio’s brother Edgar helps run the place by chopping vegetables and occasionally stepping up to the stove.  The place is small, usually a bit crowded and you get this feeling that these guys earn every dime the hard way.  Always hustling, delivering lunches and dinners, but never forgetting to smile.  Recently I had the following conversation with Edgar.

“Enrique, cómo está, my friend?”, says Edgar as I enter the store.
(Everyone there calls me Enrique.  I think I might have said my name was Enrique, when we first met.  They also sometimes call me Che Guevara. So it goes.)

“Bien, gracias, ¿y tú?”
(This is the extent of my Spanish)

“Very good; jost getting the lonches ready”
(He knows, that was the extent of my Spanish.)

Then after a while.

“Enrique, I want it to ask you so’mthin. You mens’ioned that you studied stateestics. I want to ask you, what should I do to get into this field.”

I feel embarassed.  I never thought that he could be even remotely interested in the subject. So, like a total ass I proceeded with a question of my own.

“Oh yeah?  Have you taken any math classes?”, I ask sardonically.

“Well,  jes.  I have taken Calculus I through III, Probability Theeory, Leenear Algebra, Differensial Equasions, Real Analysis I, Real Analysis II, Heestory of Mathemathics, and lets see.  I have also taken Leenear Regression.  They let me take graduate classes at Hunter, since I have done so well in my undergraduate ones”, nonchalantly answers Edgar, while wiping off the butcher’s knife.

OK, now I feel like a special kind of ass – a dumb-ass.

“Dude, you are much more qualified than I ever was.  What can I do for you?”, I said trying to recover.
(I call a lot of people ‘dude’.  Even my daughter.  She once said: “I am not a dude daddy, I am a girl!”. “Whatever, dude”, came the reply.)

“Well, I was looking for graduate schools, but they are so espensive. I don’t think I can afford it.”

“You don’t need to go to graduate school, Edgar.  I mean you can, but I think that you can get the skills necessary without it and perhaps learn even more.”

When I came home, I thought about how one would go about learning the subject without relying on the traditional venues.  What if the person does not have Edgar’s math background?

The Teaser
The first question one usually asks when confronted with a new subject is why?  The answer is that statistics will make you a better human being, no more, no less.  As a side effect, it will help you make better decisions when confronted with uncertainty, which is pretty much always.

Here is a teaser.  Suppose you are given two sequences of Boy/Girl births: BGGBGG and BBBBBG.  Which one is more likely to occur?  If you think that the first one is more likely, you are in the majority.  You are also wrong, but not to worry.  A casual investigation of probabilities will clear this right up.

Since, I am in the baby mood, here is another.  A small rural hospital and a large urban hospital reported a ratio of boys to girls to be 0.70 on a given day.  Do you believe this?  Where is it more likely to occur?

Study after study shows, that we do not have the intrinsic intuition for probabilities.  Nature, it seems, did not prepare us to live in a highly uncertain world.  All of us need some help with this, some more than others.  For example, if the right answer to this problem, is immediately obvious to you, you are either a highly evolved human being or an alien robot.  In either case, I envy you!

The Tools
While discussing the subject, I purposely avoid terms like Machine Learning, Data Mining, and Data Science.  A lot has been written about the subtle differences.  If you care about that, here is one discussion thread.  Also, see a related post from Drew Conway here.

There are three main threads I see that are needed to be an applied statistician. (Theoretical statistics is largely an academic discipline, and if you are interested in it and live in New York take a look here.)

  • PROGRAMMING: Statistics is becoming more and more computational.  In the periphery, it is important to be able to prepare your data for analysis.  At the core, a rapidly evolving, but old discipline of Bayesian Statistics seldom offers analytical solutions and the ability to program simulations becomes very important.  Programming your own optimization algorithms, is also a pedagogical and rewarding exercise.  Do you need to be a computer scientist?  No, but it helps to think like one.  If you have never programmed, read the first chapter and ignore the references to Python.  More on programming languages later.
  • FOUNDATIONAL MATH: Despite Edgar’s love for Mathematics, a basic grasp of multivariate calculus, probability, and linear algebra is enough to acquire the skills and intuition necessary for applied statistics. If you want to dig a little deeper, Measure Theory, a branch on Real Analysis, helped solidify the foundations of probability. It is good to know it exists, but I would trust that it works, and put off the investigation till much later.  If all this sounds daunting, don’t be alarmed.  Anyone can learn enough math to understand what is going on.  (I did, and I am no mathematician.)
  • STATISTICS:  This will largely depend on the type of work you enjoy doing, but the core will include Linear Regression (like predicting income based on age, sex, education, and so on) and Generalized Linear Models (like classifying a tumor as benign or malignant, based on its size.)  I would also recommend Time Series (used in climate models, economics and finance, among others.) Understanding of Markov Chains and Markov Chain Monte Carlo (MCMC) will be helpful for Bayesian Statistics. This list go on and on, but even if you master the first sentence in this paragraph, you are way ahead of the curve.

If the above sounds like a bunch of gibberish, do not be deterred.  Because, so far I have not told you much.  I listed and named the areas of focus, but I have not described the method.  Where should I acquire these skills, using what mode of interaction, and in what sequence?

To give you a preview, I do not like the highly theoretical and sequential method used in the most graduate Statistics programs.  In the next post, I will describe an alternative.

In the meantime, I am off to Claudio’s cafe.   I hear he has lentil soup and goat stew cooking.  (While I eat, Claudio is trying to convince me to follow Jesus.  After sampling the food, I am starting to believe…)

I only had two sates of mind

My Story of Risk(tail)

I am lying on the floor of my office with my fingers lodged into my hair and wondering what is the right amount of force to apply so it would hurt like hell, but not result in bleeding.  The last bit reminds me that I am not completely crazy.  Not yet.

It is 2005 and I have a comfortable job running an IT group at a small consulting company in New Jersey.  I have a nice office with a view of trees, grass, birds, and other suburban awfulness.  The job pays very well, my daughter was just born and my son is turning 3.  In short, I think that I am dying a slow painful death.  I am so miserable, I want to scream obscenities at innocent bystanders.  Or rip my hair out.  Or whatever else one does to feel alive again.

So naturally, I decided to do something risky.  Why not  try my hand at trading derivatives?  And so I did.  I lost quite a bit of cash right away, which is the best thing that could ever happen to me.  There was a small chance that I could have made a lot of money (I was long out of the money calls), and had it happened, it would probably ruin me. Trading (options) was a great way to learn about myself, but that is another story.  Anyway, options trading was fun, but it was not enough.  Not even nearly enough.

In the end I left it all.  The job, the safety, the trees and the grass, my wife and my children, all my savings, and some of my sanity.  I think I kept my humanity.  You be the judge.  It all started when my wife and I moved to New York in 2007.  Then came my fascination with statistics, start up(s), divorce, misery, more misery, and redemption.  This is a story of risk through the eyes of one risk taker.  Have a seat.  It might get bumpy.

One of our first Risktail design sessions (Columbia SSW, 9th floor)

When we moved to New York, I was no longer working for the man.  I rented a small office on 28th and Broadway, joined a California startup (remotely), traded more actively, and took a couple of statistics classes.  To overcome the loneliness of solitary trading, I started New York Options Traders, a group that we run to this day.

After we had 200 or so members on meetup.com, Chris from the products group at International Securities Exchange, contacted me and offered to sponsor the group by providing meeting space, food and drinks, and occasional speakers.  I was shocked.  Why would real professionals want to waste their time with me and my tiny group?  (Thanks Chris and Mark – you helped more than you know)

The side effects of community building are just as rewarding as main effects.  I have met great friends and colleagues in the process.  Most of all, I was happy when people told me how much they liked our meetups.

While trading and interacting with other traders, I realized that the trading business is about managing uncertainty over the long run, not nailing individual trades.  I had some intuition for it, but I wanted to learn the subject more fully.  So on the recommendation of a good friend and Statistics professor Robert Stinerock, I enrolled in statistics department at Columbia and started dusting off my old calculus textbooks.  (Turns out this is completely unnecessary for trading, but that is also another story)

Columbia University is an interesting place.  I have met some of the smartest and some of the ‘not so smartest’ people there, and sometimes that would be the same person!  I had a passion for statistics, and I enjoyed taking classes.  I also liked the youthful feel of the place and I used it often as an escape from the daily routine and from my deteriorating marriage.

I ran into Alex in one of my classes.   He was working for Morgan Stanley at the time and was on the way to becoming a rock star exotics trader.  As I later found out, Alex had an offer to go to Moscow and help a senior trader build an exotics desk for the Russian Central Bank.  This job would have most likely made Alex some serious coin.  Alex never went to Moscow.  Instead he agreed to work with me.  This is how Risktail was born.

Left: Alex working out a kitchen, our first 'office'. Right: In the Statistics lounge at Columbia

Meeting guys like Alex was the best part of going to Columbia.  Side effect turned out to be better than the main effect.  Go figure.

Risktail idea was to help options traders navigate the stormy waters of listed options.   When you buy a stock or a future, your position PnL goes up or down by a fixed amount in proportion to the size of your position (with no leverage, if the stock is up a dollar, your PnL is up a dollar).  It is for this reason that stocks are sometimes called linear instruments.  Not so with options.  For a dollar rise in the underlying stock, your PnL may go up 10 or down 0.50.  The next increment can result in down 20, and so on.  In this sense, options have non-linear payoffs despite the sophomoric hockey stick diagrams.  This non-linearity gives options all the interesting, albeit unexpected properties.

My idea derived from personal experience and running NYOT,  was that instead of relying on the model heavy analysis of options portfolios, we would let the analyst run through historical options prices subject to specified constraints and thereby provide a real feel for the evolution of the portfolio.  This is called backtesting, and as it turns out, options backtesting is a really hard problem.

Complex analytics require some careful architectural consideration.  A more rigorous design process would have revealed the complexity of the problem and would have helped reduce the scope of the initial product. We particularly underestimated the amount of effort required to obtain and clean options data.

Alex and I started working in the statistics lounge on the 9th floor of the Columbia’s SSW building.  At that point, I left the California start-up and instead started focusing full time on Risktail.  Alex developed a basic prototype and I got us a demo with an up and coming retail options brokerage firm (Thanks Aric.  To this day I believe that if we had a working system, you would have worked with us.)  The demo went great and we got a lot of encouragement, so naturally I felt that we were onto something.  In retrospect we might have had something, but the optimism was premature.

Despite what it I originally thought, most people wanted us to succeed.  Turns out, when I was passionate about the project, people generally gave me the benefit of the doubt and went along with my story, and encouraged me to continue.

In the winter of 2010 while reading Fred Wilson’s blog, I found out about the NYCSeed start-up accelerator program run by Owen Davis and immediately applied.   In the spring, we were invited to present to the program mentors in the New York office of DogPatch Labs, which is run by the wonderful Peter Flint.  In a separate email, I notified Owen that I was recently served the divorce papers and felt that even though I could manage it while building Risktail (I was wrong), it was an important enough event that I needed to disclose it.

Most of my friends told me I should not disclose this.  They will not work with you, many of them said.  I said bullshit.  If they don’t want to work with me because of this, fine; but they still need to know.  

We continued building the product and in April we received a form letter from Owen that we did not make the cut.  Not deterred, we continued building.  Then sometime in July, when I was sitting in Uris Hall in the Columbia’s Business School Library, my cell phone rang.

“Hello, is this Eric?”

“Yes”

“This is Owen Davis”

“hmmm, Hello”

“I was wondering were you guys are with Risktail?”

“Still building. Had a few early demos with potential customers.”

“Would you be interested in joining NYCSeed this summer?”

I swallowed hard.  I knew that Owen’s group would bring us the type of experience, opportunities, and connections, that would be hard to obtain on our own.  This was potentially life changing.  Not wanting to sound like a wimp, I said:

“Let me talk it over with my co-founder.  I will get back to you tomorrow.”

“Sure thing”.  Click.

I called Alex.  He almost fell off whatever he was sitting on.  I believe his response was something along the lines of “Fawk Yeaaah….”

But there was a little problem.  My funds were rapidly deteriorating.  I had no other source of income and my divorce was escalating.  I was considering taking a side consulting gig.  $20,000 gift from NYCSeed would not be enough to sustain me, my family, and Alex. (It was not really a gift; it paid for 5% common in Risktail and implicit follow on rights if desired, but given the expected value, this was a far out of the money call option; sort of like my first trade and turned out to be just as successful.)

So without thinking about it much, Alex, who still had some of the Morgan Stanley money, immediately offered to lend me the contents of his bank account.  Floored by his generosity, and afraid of all possible repercussions (he is Russian, you know), I hesitated.  We talked it over while taking a walk up 5th avenue near Columbus circle and sipping on iced tea.  It was a hot summer day.

“Dude, we should definitely do this”, said Alex.

“I don’t know man.  I mean, I appreciate all the help, but this is a lot of money”

“No worries. We will make it back.”

I think he felt that this was his shot at greatness as well, and he did not want to pass it up.  He also knew that I would make it right one way or the other.  Hesitant, but optimistic, I decided to take the plunge.  And so, stage 2 of our project had begun.

At betaworks. (They have really nice paintings)

NYCSeed secured office space at NYU business school and four other companies joined us there.  The schedule was packed with mentor meetings, dinners, legal presentations, and other things that I no longer remember.  But all I kept thinking was how am I going to close the first deal, the first beta, the first anything.

Matt Gorin from Contour Venture Partners, who was our primary mentor, kept pushing us to come up with the stronger value proposition, the secret sauce, in the VC parlance.  I too felt that the product was missing something, but I couldn’t quite put my finger on it.  I decided to call Tom Sosnoff, who was still running Thinkorswim at the time.  When I showed him the app, he was not impressed.  Nice, but get back to me when it works.  Thats what it was missing.  It did not work!

This should have been a big clue for us.  We (Alex and Luka, our Serbian genius coder) spent over 6 months of intense coding on the product that still did not work.  A small team with limited resources can not afford not having something out in front of the customers in 6 months.  The corollary was that that the product was too big for our team and for our resources (i.e. too many features that I mistakenly thought we desperately needed.) 

A more evolved design, in the basement of NYU Stern School of Business

When we started, my intent was to sell to retail brokers thinking that it is the shortest path to customers.  My experience in B2B told me that if I nurture few early enterprise accounts, they will take me from concept to product and pay for it.  But retail brokers are not your typical enterprise customer in a sense that their key products are NOT facing their internal users who are used to being abused by awful software.  Their users are retail traders who are accustomed to beautiful software that works almost without a hitch.  It was unreasonable for me to think that they would take an alpha version, not matter how much they liked us.

In fairness, Roger Ehrenberg from IA Ventures told us early on that selling to retail brokers is a giant pain in the arse.  He was not giving us a typical VC blurb from 30,000 feet.  Roger did a few deals in the space and had some experience in that market.  Stubborn as I was, I ignored his advice and we kept working the feature set.

By the time the demo day rolled around we had a few more leads and expressed interest from another large retail brokerage, but still no working product.  We wanted to raise a $700K Seed round, but the reception was lukewarm.  Show us some traction was the most common response.  This should have been a clue also.  At seed stage, it is possible to raise a round without a ton of traction (it is hard, but possible, and traction does not have to be measured in $$).  What a lot of them were saying was that they did not believe this was a compelling opportunity and there were no signs of market demand.  I hate to say it, but they were right. (Yes, VCs can be right sometimes.)

In retrospect, we were going into a very crowded space and not disrupting it!  Existing retail brokers would have all the leverage, even if we built something truly remarkable.

By the time I was busy running around New York meeting with potential investors, my divorce was heading for trial and I had fired my attorneys.  I missed one month of temporary support payments to my wife and was facing possible jail time for contempt of court.  Emotionally, I started to lose it.  I remember standing in front of the judge and being told that it was in my best interest to hire a lawyer.  I had no money left to do that.  I could not even pay my previous lawyers legal bill.  I walked out of the courthouse, looked up at the sky, and for the first time saw complete darkness all around.  It was high noon and sunny … I think.

Right before a preliminary hearing in Family Court.

We did not raise the Seed round.  The sad thing is that one of our potential clients agreed to pay us to complete the product.  It would not be a lot, but had we gotten it earlier, we might have taken it.  At this point, both of my co-founders worked without pay for almost a year.  The gig was up.  It was too little too late.  We threw in the towel.  I threw in the towel.

Hindsight is 20-20 as they say.  I believe that first and foremost my mistake was strategic.  In particular, we should not have chosen to pursue the indirect sales channel, given our very limited resources.  Even if we were able to get installed, we still would not know if we had a viable business, as we had no access to price tolerance data and our ability to scale beyond a few initial customers was largely unknown.  Instead, we should have focused on building a small strategy tester for one asset class, nailed it, and went direct.  We still might have missed, but at least we would have gotten a more diversified feedback from actual users.

I know now that focusing on a sales channel from day one is a mistake.  You are once removed from the end user.  And once is too many in this case.

In the end we failed and failed slow.ly, but it was not for the lack of trying.  Guys who worked with me on this gave it all, and then some.  Luka, who is currently a lead developer at Real Direct created our beautiful front end without knowing anything about UX and Adobe Flex.  Alex did not know Python when he started and built a super fast analytics engine.  Mladen, our statistician in residence, spent many hours coaching us what the data presentation should look like, and was the only one who was always telling me that I didn’t know what the fawk I was doing.  Boy, was he right. Thanks, dude, I needed that!

Last version of the Ristail UI

Overall, the experience was transformational and in a sense life changing.  If I have any regrets, it is only that I have not delivered for my team, for our mentors and investors, and for our families who had not seen much of us as a result.  All the glory belongs to them; the mistakes are mine, and mine alone.

Even though I came out bruised, I am not deterred.  Now more than ever, I am optimistic about the rate at which technology is transforming industries from education, to government, and yes to finance.  Once again, I am starting to write down ideas, and I am looking up in the sky.  It is so bright outside; I may need sunglasses.  It is now close to midnight.  I am ready to work.