15 Principles for Data Scientists

•June 2, 2013 • 9 Comments

I have developed 15 principles for my daily work as a data scientist. These are the principles  that I personally follow :

1- Do not lie with data and do not bullshit: Be honest and frank about empirical evidences. And most importantly do not lie to yourself with data

2- Build everlasting tools and share them with others: Spend a portion of your daily work building tools that makes someone’s life easier. We are freaking humans, we are supposed to be tool builders!

3- Educate yourself continuously: you are a scientist for Bhudda’s sake. Read hardcore math and stats from graduate level textbooks. Never settle down for shitty explanations of a method that you receive from a coworker in the hallway. Learn fundamentals and you can do magic. Read recent papers, go to conferences, publish, and review papers. There is no shortcut for this.

4- Sharpen your skills: learn one language well so you can be called a pro. Learn other languages good enough to be able to communicate with others. Don’t forget, SQL is like English, it is spoken by every moron on this planet but if you master it you can make beautiful poetry. Learn a compiled language, an interpreted language and R. Or just learn R! It is ugly but it will give you an edge. And fuck Matlab, you are not an undergrad anymore. Learn Unix, even if you use Windows, learn sed and grep and all that. You can do wonders with bash and powershell. If you want, learn how to use Hadoop too but know that it is a crappy system.

5- Know that a data scientist has one purpose in life “Kick ass and amaze people”: Do one thing every day related to this

6- Challenge yourself often, by presenting your work to others. Do not be scared of a few douchebags who might criticize your work. Crush them, If you wanted to be scared of cockroaches you could have not even walked!

7- Be generous with your knowledge and Don’t be afraid to ask questions: some people are insecure about their knowledge and do not share it, forgive them but do not be one of them.

8- Develop some ideas first and then listen to other people’s insights, utilize what they know about the domain but do not restrict yourself to that: If they could solve the problem with what they knew the wouldn’t come to you for a solution.

9- Hang out with people, talk to them, learn how you can be useful in their projects and how their work can benefit your projects

10- Build impressive and interactive user interfaces for your bland codes: Code is our language, let your code shine with a UI.

11- Use visualization efficiently, avoid hard-to-understand graphs: The only purpose of visualization is to make data understandable not confusing

12- Learn about new technologies and strive to understand the fundamentals of classic technologies

13- Over promise and over deliver: this is how genius people work. Do not be scared of proposing creative ideas. Have you heard of “under promise and over deliver?”   that’s how shitty cubicle rats work. Don’t be one of them.

14- Stay Creative and Focused: you can win with creativity and focus (caffeine can help here but do not overdo it)

15- Be positive, work hard and if anyone wants to stop you just crush them

Learning C++11

•June 22, 2012 • Leave a Comment

(I am still updating this post, I am learning C++11 and this is my live blog post. There might be typos and bugs)

I have been hearing about the modern C++ and I feel that it is something that has a future. I mean C++ is already strong, it has survived for 30 years but when I code in C++ I need to use my brain cycles for stupid things. I need to think about data structures carefully but when I am working in something like Python I am free to get creative, I do not need to care about low level stuff. I can just code and hope that my code is going to run reletivley fast.

The new C++ seems to be pretty amazing. I am not just talking about the “auto” keywork and type inference. The lamba functions seem to be very useful and C++ now has things that were available in Java and C# from the beinning (remember for_each?). What I like to say is that C++ is now a

In this thread I will collect documents that can teach you and myself about the new C++. I will be a little careless about copyrights but nothing on this page is mine and I have just compiled it.

1- Watch the video, “Not your father’s C++” by our man Herb Sutter

2- Read Herb Sutter’s blog post “Elements of Modern C++ Style

3- Lambda Expressions

An example of lambda functions is on this page 

#include <iostream>

using namespace std;

int main()
    auto func = [] () { cout << "Hello world"; };
    func(); // now call the function

I use Visual Studio 2010 and Lambda functions already work in it. You may want to add #include “StdAfx.h” on top of your source for your code to work New Features The new features are summarized here

This is another good example of Lambda expressions “Using Lambda Expressions for Shorter, More Readable C++ Code

The wikipedia article for C++11 is relatively useful the only problem is that it does not highlight which features are already implemented in VS 2010 or 2011. For example constant expressions are not yet supported in Visual Studio (at least not in the 2010 version that I use)

TODO: add this on Modern C++ http://msdn.microsoft.com/en-us/library/hh279654(v=vs.110).aspx



Learning C with gdb 

My notes from the “Learning to learn” talk by Stanford’s Benjamin Von Roy

•April 17, 2012 • Leave a Comment
Benjamin Von Roy

Benjamin Von Roy

Below are my notes from a talk entitled “Learning to Learn” by Benjamin Von Roy. I am reading some of the references and will add more to this document to make it readable for others soon.

I will discuss the importance of learning to learn, and how this is a distinctive element of reinforcement learning relative to other areas of statistical learning. I will then survey some relevant research and discuss recent work with Zheng Wen on an algorithm that efficiently learns to learn (and learns) in dynamic systems with arbitrarily large state spaces by combining optimistic exploration and value function generalization.


Bio: Benjamin Van Roy is broadly interested in the formulation and analysis of mathematical models that address problems in information technology, business, and public policy. He is a Professor of Management Science and Engineering and Electrical Engineering, and, by courtesy, Computer Science, at Stanford University. He has held visiting positions as the Wolfgang and Helga Gaul Visiting Professor at the University of Karlsruhe and as the Chin Sophonpanich Foundation Professor of Banking and Finance at Chulalongkorn University. He has served on the editorial boards of Discrete Event Dynamic Systems, Machine Learning, Mathematics of Operations Research, and Operations Research, for which he is currently the Financial Engineering Area Editor. He has served as a researcher, advisor, founder, or director, for several technology companies. He received the SB (1993) in Computer Science and Engineering and the SM (1995) and PhD (1998) in Electrical Engineering and Computer Science, all from the Massachusetts Institute of Technology.



Reinforcement Learning Models in Literature

  • Myopic Learning
  • Dithering??
  • Reinforcement Learning

What  is this “Multi-armed bandit” I keep hearing about it everywhere there is an online ad talk. I should learn it. Watch this video lecture later.

Literature on efficient reinforcement learning:

  1. Kearns-Singh 2002
    1. Devise plan to learn soon if possible
    2. Otherwise plan to exploit
  2. Braffman-Tennenholts 2002
    1. Optimistic exploration
  3. Kearns-Koller 1999
  4. Abbasi -Yadkori-Szepesvari 2011


A Mandelbrot Fractal in Python

•April 12, 2012 • Leave a Comment

I coded up this Mandelbrot fractal in python while watching TV. Not sure if is helpful for anybody but you may want to take a look at it and enjoy the bauty of chaotic dynamical systems. The code is posted here and below too. Here is a fascinating high quality version of it.

# Mandelbrot set
# By Mark Alen
# linux_jvm@yahoo.com
# April 2012
import ImageDraw
from PIL import Image, ImageFilter
from math import log
white = (255, 255, 255)
width = 5000
height = width
image1 = Image.new("RGB", (width, height), white)
draw = ImageDraw.Draw(image1)

# http://en.wikipedia.org/wiki/Mandelbrot_set
for xpix in range(1,width+1):
 for ypix in range(1,height+1):
 x0 = (xpix*1.0/width*3.5) -2.5
 y0 = (ypix*1.0/height*2)-1
 x = 0
 y = 0
 iteration = 0
 max_iteration = 1000
 while ( (x*x + y*y) < 4) & (iteration < max_iteration ):
 xtemp = x*x - y*y + x0
 y = 2*x*y + y0
 x = xtemp
 iteration = iteration + 1
 mycol =int(255.0*(1-log(iteration/1000.0*255+1)/log(256)))
 color = (255-mycol,255-mycol,255-mycol)
 if iteration == max_iteration:
 color = white
 draw.point((xpix,ypix), color)
 if ((xpix*(100/1)%(width))==0):
 print (xpix*ypix*1.0)/((width*height))*100.0, "%"
filename = "mandel.png"
print "Done!"

Dear Coursera and Udacity! Don’t congratulate yourself too much

•March 27, 2012 • 21 Comments

So a couple of smart professors from Stanford have started two separate startups and have put their courses on the web and the world is going nuts. Everyone is talking about them and they are busy congratulating themselves on this amazing accomplishment. Every major paper is writing about how these professors are revolutionizing education and how amazing these wesbites are (see Wired or NY times articles). Do not get me wrong, the work that Udacity and Coursera are doing are way more superior to MIT’s course dump (OCW) but is it really what we were all envisioning for online education? I mean come on! We did all that research on distance learning, collaborative whiteboards, online labs and we ended up with these low quality Khaan Academy videos? are you kidding me? We have a whole freaking academic community specifically around engineering education, they even publish scientific journals!

I have a lot of respect for the professor who started this, Sebastian Thrun, whose wonderful book on probabilistic robotics was my bible for a long time, but here is what I think they are doing  wrong.

Both classrooms (udacity and Coursera) are too similar to regular classroms.

Just like a classroom, the course starts on specific dates and goes on for 7 weeks. Students need to stick to deadlines, do quizzes, submit homeworks, and finish on time. There is no flexibility, there is no customization, you will take the same course as the next guy over the internet with a completely different background.  What if I want to learn a topic in a year instead of 7 weeks? What if I want to learn it in 10 years? For example I was busy last week and was catching up on my emails today, one of the emails was from Coursera announcing that their algorithms course was going live last week, when I went to sign up today it told me that I cannot enroll now! My question is: why? seriously why cannot I start whenever I want and finish whenever I want? This is the same thing that I hated about my old fashion offline university!

In fact  Professor Thrun has published  his vision for online education as a university that has the follwoing elements:

…Nine essential components of a university education: admissions, lectures, peer interaction, professor interaction, problem-solving, assignments, exams, deadlines, and certification.

Are you kidding me? I know a system that was around way before the web and had the same elements, its name is “College”. So all you have done is taking the same lectures and making a video out of them and it has become the revolution in education that we were all dreaming about?

This is what I think: People are taking these courses because for many it is the only way to learn about interesting topics like robotics or machine learning. Take a video of a Stanford professor talking about a hot topic and people will eat that up. That does not necessarily mean that we have unlocked the power of online education. I also doubt it will give any value to a Stanford student who can sit in the real classroom.

To me this is aiming low, it is giving up on our dreams,  it really is a failure.

5 Reasons Why We Live In A Freaking Exciting Time

•February 18, 2012 • 2 Comments

Our friend and comedy extraordinaire Louis C.K. has a short clip called “Everything is amazing and nobody is happy”. In the clip he basically asks why don’t we get excited about simple things that technology has brought to us? Watch it below.

I keep telling my girlfriend that we live in an exciting and extraordinary time. And she keeps telling me that my dad has been probably saying the same thing 30 years ago. The thing she doesn’t know is that my dad was just a kid on the streets of a poor third world country, struggling to finish his PhD without Wikipedia and the Internet (even though Al Gore had invented the Internet a couple of years earlier 🙂 But now we have all these exciting things at our fingertips. With Wikipedia I am hundred times smarter than my dad at his smartest point in his life.

Below, I’ll give you 5 reasons from my everyday life and hopefully I can convince you that we should all go “Oh my God, this life is fucking awesome”. I will just give you examples about education as we rarely get excited about anything else.

Reason 1. TV is changing and educational videos are becoming cooler to watch: We have a big TV at home, but we do not have cable. It is hooked to a small computer with which we can watch all sorts of shit, from the education channel in youtube or some of the TED videos that do not suck major buttocks. By the way have you watched “Justice with Michael Sandel“? highly recommended.

Reason 2. We can now take our education on the road with us: Even I, as a poor grad student can now afford to have a couple of wireless devices. I can load PDF files onto my tablet and read them on the road. Through our university we get full access to books from O’Reilly and Springer. I can read them on the train to work without killing a lot of trees.

John Canny

John Canny is a professor at Berkeley. He is mainly known for this fact that he unlocked the secrets to longer than 24 hours days. When he was at MIT he used only 24 hours of his 70 hour days to invent the Canny edge detector and used the rest to date women. After he got married and settled down he started utilizing the rest of his days to invent things in HCI, machine learning and God knows in many unrelated fields like healthcare and psychology. To this date I am still wondering how he can make all these contributions. Many scientist go into severest depressions as soon as they realize that they can never be John Canny.

3. High quality education is becoming accessible to every fucking idiot: this is the most exciting thing for me. I have probably paid thousands of dollars to UC Berkeley. And I am kind of happy about it. That gives me the privilege to be able to sit in classes taught by professors like John Canny and Micheal Jordan. But I can also stay home and get the same education from the Internet. I can watch Andrew Ng’s machine learning class or Berkeley’s scalable machine learning without paying a penny (well, I pay 40 bucks to those bloodsuckers at  AT&T for the internet but that’s another story). Also if you like these things I highly recommend following John Canny’s Behavioral Data Mining course.

4. Science is actually being used now: When you read something about technology you know that it is being used right now. I was reading a paper about the All-Reduce method and it was great to know that my homeboy, John Langford, has used it in in his Vowpal Wabbit and Yahoo is using it for spam filtering.

5. Science experiments becoming inexpensive: I have an awfully incapable laptop but for a very cheap price I can now get a large cluster crunching numbers for me. My friends at Udacity are now teaching high school kids how to build a kick-ass search engine using commodity computers.

I do not know about you but every day when I wake up I feel freaking blessed that I live in this time. As Salman Khaan says, if these things do not make you excited, you might have no soul 🙂

Disclaimer: I’m indebted to professor Canny immensely. And I have a lot of respect for the man.  This is just a joke.

Notes from Neel Sundaresan’s keynote speech at RecSys 2011

•October 26, 2011 • Leave a Comment
Neel Sundaresan

Neel Sundaresan

He started by stating that he won’t have any greek symbols in the talk.

Arch West was the inventor of Doritos and David Pace we the inventor of Pace sauce. What they did was that they noticed the can sell more if they advertise the two products together. There is a lesson in cross-selling and recommender systems that we can learn from this story.

eBay started when Pierre Omidyar wanted to sell his broken laser pointer. He listed the laser pointer online for 99 cents and finally sold for 14 dollar. Was wondering if the person who bought it knew that it is broken. The guy responded “Yes I am a collector of broken laser pointers”

Why people buy something? it is hard to say. Some people buy stories! remember the toast that sold for 27K that was the cheapest marketing campaign that a casino had.  One man’s trash is another’s treasure.

The long tail in eBay’s context mean most people sell very few items and most of eBay’s revenue comes from these people. (i.e the mean is way larger than the median

The users constantly are running experiments to maximize their revenue. They are constantly testing to see if free shipping can sell more, different selling strategies are being tested by users at any time on eBay.

This causes an interesting behavior. If you promote a user’s product on the homepage they may increase their price! This is an interesting dynamics between the user and the seller (eBay)

One of the problems that locations like eBay have is the problem of big data. Complex algorithms are often impossible to work with in that scale. If you are looking for a job at eBay you need to know how to work with data in that scale.  A goal at eBay lab is that when a new scientist joins the lab on Monday he got access to all the data by Friday.

This amount of data has changed how economics is doing experiments. They can now run experiments on 400 million data points.

What are you optimizing for at eBay? is it profit maximization? do you want to increase the shopping cart size? are you looking for maximum customer satisfaction?

The other thing is how do you measure success?

Everything we do at eBay is a recommendation.

I KEEP six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who.

— from The Elephant’s Child 

When we look at the tag cloud of eBay you see keywords like “used”, “vintage” and “antique” a lot more than “new”

The search is an interesting problem some people are looking for “ipod nano 4gb black new” and some are looking for the skin for their ipod. Our search engine should be able to differentiate between “ipod nano 4gb black new” “ipod nano 4gb black new skin”. This proposes a hard and challenging research questions.

Click trails can help us tremendously with building recommender systems that can capture these behavior and improve recommender systems.  At eBay a data cleanup is an important part of recommender. Specially when they use click trails.

eBay has a language like pig that allows them to do pattern recognition at scale. Sometimes a search is followed by some page views and another search. This pattern is useful to do recommendation to other users who have similar initial search queries. See two recent papers from Sundaresan for the results and model.

Fashion item buyer on eBay are very brand aware. Sometimes ebay does not have enough inventories and needs to recommend proper products from outside websites.

one of the challenges at eBay is that we do not have a catalogue of items (remember the laser pointer story?) Amazon does not have such a problem, you cannot sell anything on Amazon unless it is on the catalog.

eBay uses its own matrix factorization see their ICML paper. The sparsity in eBay’s data is fascinating it is 100 times the sparsity in the Netflix data.

eBay clusters items into pseudo products using LDA. He shows an example of a recommendation for a broken blackberry cellphone.

The most important thing is “why” are you recommending this to the user and “why” they should buy it. HCI is a useful tool here. reveal to the user why you are recommending. Something like “52% of the people who bought this item also bought …” are very effective.  Be very explicit on why certain recommendations are made.

Let’s look at “When”. things like reminders, post purchases, urgency, upgrades, seasonal sales fall into this. Reminders can be like “you have viewed this item” that reminds people that they can still go and buy. There is a temporal element to this problem too. They may not need the same item until after 30 days but need to buy it again after 30 days is passed.

See this wired article on persuasion based profiling and recommendation systems. (thanks to twitter).

There is a lot of seasonality on eBay. Mother and Father’s days, Christmas. There are other events that are we don’t know (so my question is how can we find them algorithmically)

We get more data from mobile devices than we get from online. It is a huge research opportunity.