15 Principles for Data Scientists

•June 2, 2013 • 9 Comments

I have developed 15 principles for my daily work as a data scientist. These are the principles  that I personally follow :

1- Do not lie with data and do not bullshit: Be honest and frank about empirical evidences. And most importantly do not lie to yourself with data

2- Build everlasting tools and share them with others: Spend a portion of your daily work building tools that makes someone’s life easier. We are freaking humans, we are supposed to be tool builders!

3- Educate yourself continuously: you are a scientist for Bhudda’s sake. Read hardcore math and stats from graduate level textbooks. Never settle down for shitty explanations of a method that you receive from a coworker in the hallway. Learn fundamentals and you can do magic. Read recent papers, go to conferences, publish, and review papers. There is no shortcut for this.

4- Sharpen your skills: learn one language well so you can be called a pro. Learn other languages good enough to be able to communicate with others. Don’t forget, SQL is like English, it is spoken by every moron on this planet but if you master it you can make beautiful poetry. Learn a compiled language, an interpreted language and R. Or just learn R! It is ugly but it will give you an edge. And fuck Matlab, you are not an undergrad anymore. Learn Unix, even if you use Windows, learn sed and grep and all that. You can do wonders with bash and powershell. If you want, learn how to use Hadoop too but know that it is a crappy system.

5- Know that a data scientist has one purpose in life “Kick ass and amaze people”: Do one thing every day related to this

6- Challenge yourself often, by presenting your work to others. Do not be scared of a few douchebags who might criticize your work. Crush them, If you wanted to be scared of cockroaches you could have not even walked!

7- Be generous with your knowledge and Don’t be afraid to ask questions: some people are insecure about their knowledge and do not share it, forgive them but do not be one of them.

8- Develop some ideas first and then listen to other people’s insights, utilize what they know about the domain but do not restrict yourself to that: If they could solve the problem with what they knew the wouldn’t come to you for a solution.

9- Hang out with people, talk to them, learn how you can be useful in their projects and how their work can benefit your projects

10- Build impressive and interactive user interfaces for your bland codes: Code is our language, let your code shine with a UI.

11- Use visualization efficiently, avoid hard-to-understand graphs: The only purpose of visualization is to make data understandable not confusing

12- Learn about new technologies and strive to understand the fundamentals of classic technologies

13- Over promise and over deliver: this is how genius people work. Do not be scared of proposing creative ideas. Have you heard of “under promise and over deliver?”   that’s how shitty cubicle rats work. Don’t be one of them.

14- Stay Creative and Focused: you can win with creativity and focus (caffeine can help here but do not overdo it)

15- Be positive, work hard and if anyone wants to stop you just crush them

Learning C++11

•June 22, 2012 • Leave a Comment

(I am still updating this post, I am learning C++11 and this is my live blog post. There might be typos and bugs)

I have been hearing about the modern C++ and I feel that it is something that has a future. I mean C++ is already strong, it has survived for 30 years but when I code in C++ I need to use my brain cycles for stupid things. I need to think about data structures carefully but when I am working in something like Python I am free to get creative, I do not need to care about low level stuff. I can just code and hope that my code is going to run reletivley fast.

The new C++ seems to be pretty amazing. I am not just talking about the “auto” keywork and type inference. The lamba functions seem to be very useful and C++ now has things that were available in Java and C# from the beinning (remember for_each?). What I like to say is that C++ is now a

In this thread I will collect documents that can teach you and myself about the new C++. I will be a little careless about copyrights but nothing on this page is mine and I have just compiled it.

1- Watch the video, “Not your father’s C++” by our man Herb Sutter

2- Read Herb Sutter’s blog post “Elements of Modern C++ Style

3- Lambda Expressions

An example of lambda functions is on this page 

#include <iostream>

using namespace std;

int main()
    auto func = [] () { cout << "Hello world"; };
    func(); // now call the function

I use Visual Studio 2010 and Lambda functions already work in it. You may want to add #include “StdAfx.h” on top of your source for your code to work New Features The new features are summarized here

This is another good example of Lambda expressions “Using Lambda Expressions for Shorter, More Readable C++ Code

The wikipedia article for C++11 is relatively useful the only problem is that it does not highlight which features are already implemented in VS 2010 or 2011. For example constant expressions are not yet supported in Visual Studio (at least not in the 2010 version that I use)

TODO: add this on Modern C++ http://msdn.microsoft.com/en-us/library/hh279654(v=vs.110).aspx



Learning C with gdb 

My notes from the “Learning to learn” talk by Stanford’s Benjamin Von Roy

•April 17, 2012 • Leave a Comment
Benjamin Von Roy

Benjamin Von Roy

Below are my notes from a talk entitled “Learning to Learn” by Benjamin Von Roy. I am reading some of the references and will add more to this document to make it readable for others soon.

I will discuss the importance of learning to learn, and how this is a distinctive element of reinforcement learning relative to other areas of statistical learning. I will then survey some relevant research and discuss recent work with Zheng Wen on an algorithm that efficiently learns to learn (and learns) in dynamic systems with arbitrarily large state spaces by combining optimistic exploration and value function generalization.


Bio: Benjamin Van Roy is broadly interested in the formulation and analysis of mathematical models that address problems in information technology, business, and public policy. He is a Professor of Management Science and Engineering and Electrical Engineering, and, by courtesy, Computer Science, at Stanford University. He has held visiting positions as the Wolfgang and Helga Gaul Visiting Professor at the University of Karlsruhe and as the Chin Sophonpanich Foundation Professor of Banking and Finance at Chulalongkorn University. He has served on the editorial boards of Discrete Event Dynamic Systems, Machine Learning, Mathematics of Operations Research, and Operations Research, for which he is currently the Financial Engineering Area Editor. He has served as a researcher, advisor, founder, or director, for several technology companies. He received the SB (1993) in Computer Science and Engineering and the SM (1995) and PhD (1998) in Electrical Engineering and Computer Science, all from the Massachusetts Institute of Technology.



Reinforcement Learning Models in Literature

  • Myopic Learning
  • Dithering??
  • Reinforcement Learning

What  is this “Multi-armed bandit” I keep hearing about it everywhere there is an online ad talk. I should learn it. Watch this video lecture later.

Literature on efficient reinforcement learning:

  1. Kearns-Singh 2002
    1. Devise plan to learn soon if possible
    2. Otherwise plan to exploit
  2. Braffman-Tennenholts 2002
    1. Optimistic exploration
  3. Kearns-Koller 1999
  4. Abbasi -Yadkori-Szepesvari 2011


A Mandelbrot Fractal in Python

•April 12, 2012 • Leave a Comment

I coded up this Mandelbrot fractal in python while watching TV. Not sure if is helpful for anybody but you may want to take a look at it and enjoy the bauty of chaotic dynamical systems. The code is posted here and below too. Here is a fascinating high quality version of it.

# Mandelbrot set
# By Mark Alen
# linux_jvm@yahoo.com
# April 2012
import ImageDraw
from PIL import Image, ImageFilter
from math import log
white = (255, 255, 255)
width = 5000
height = width
image1 = Image.new("RGB", (width, height), white)
draw = ImageDraw.Draw(image1)

# http://en.wikipedia.org/wiki/Mandelbrot_set
for xpix in range(1,width+1):
 for ypix in range(1,height+1):
 x0 = (xpix*1.0/width*3.5) -2.5
 y0 = (ypix*1.0/height*2)-1
 x = 0
 y = 0
 iteration = 0
 max_iteration = 1000
 while ( (x*x + y*y) < 4) & (iteration < max_iteration ):
 xtemp = x*x - y*y + x0
 y = 2*x*y + y0
 x = xtemp
 iteration = iteration + 1
 mycol =int(255.0*(1-log(iteration/1000.0*255+1)/log(256)))
 color = (255-mycol,255-mycol,255-mycol)
 if iteration == max_iteration:
 color = white
 draw.point((xpix,ypix), color)
 if ((xpix*(100/1)%(width))==0):
 print (xpix*ypix*1.0)/((width*height))*100.0, "%"
filename = "mandel.png"
print "Done!"

Dear Coursera and Udacity! Don’t congratulate yourself too much

•March 27, 2012 • 21 Comments

So a couple of smart professors from Stanford have started two separate startups and have put their courses on the web and the world is going nuts. Everyone is talking about them and they are busy congratulating themselves on this amazing accomplishment. Every major paper is writing about how these professors are revolutionizing education and how amazing these wesbites are (see Wired or NY times articles). Do not get me wrong, the work that Udacity and Coursera are doing are way more superior to MIT’s course dump (OCW) but is it really what we were all envisioning for online education? I mean come on! We did all that research on distance learning, collaborative whiteboards, online labs and we ended up with these low quality Khaan Academy videos? are you kidding me? We have a whole freaking academic community specifically around engineering education, they even publish scientific journals!

I have a lot of respect for the professor who started this, Sebastian Thrun, whose wonderful book on probabilistic robotics was my bible for a long time, but here is what I think they are doing  wrong.

Both classrooms (udacity and Coursera) are too similar to regular classroms.

Just like a classroom, the course starts on specific dates and goes on for 7 weeks. Students need to stick to deadlines, do quizzes, submit homeworks, and finish on time. There is no flexibility, there is no customization, you will take the same course as the next guy over the internet with a completely different background.  What if I want to learn a topic in a year instead of 7 weeks? What if I want to learn it in 10 years? For example I was busy last week and was catching up on my emails today, one of the emails was from Coursera announcing that their algorithms course was going live last week, when I went to sign up today it told me that I cannot enroll now! My question is: why? seriously why cannot I start whenever I want and finish whenever I want? This is the same thing that I hated about my old fashion offline university!

In fact  Professor Thrun has published  his vision for online education as a university that has the follwoing elements:

…Nine essential components of a university education: admissions, lectures, peer interaction, professor interaction, problem-solving, assignments, exams, deadlines, and certification.

Are you kidding me? I know a system that was around way before the web and had the same elements, its name is “College”. So all you have done is taking the same lectures and making a video out of them and it has become the revolution in education that we were all dreaming about?

This is what I think: People are taking these courses because for many it is the only way to learn about interesting topics like robotics or machine learning. Take a video of a Stanford professor talking about a hot topic and people will eat that up. That does not necessarily mean that we have unlocked the power of online education. I also doubt it will give any value to a Stanford student who can sit in the real classroom.

To me this is aiming low, it is giving up on our dreams,  it really is a failure.

5 Reasons Why We Live In A Freaking Exciting Time

•February 18, 2012 • 2 Comments

Our friend and comedy extraordinaire Louis C.K. has a short clip called “Everything is amazing and nobody is happy”. In the clip he basically asks why don’t we get excited about simple things that technology has brought to us? Watch it below.

I keep telling my girlfriend that we live in an exciting and extraordinary time. And she keeps telling me that my dad has been probably saying the same thing 30 years ago. The thing she doesn’t know is that my dad was just a kid on the streets of a poor third world country, struggling to finish his PhD without Wikipedia and the Internet (even though Al Gore had invented the Internet a couple of years earlier 🙂 But now we have all these exciting things at our fingertips. With Wikipedia I am hundred times smarter than my dad at his smartest point in his life.

Below, I’ll give you 5 reasons from my everyday life and hopefully I can convince you that we should all go “Oh my God, this life is fucking awesome”. I will just give you examples about education as we rarely get excited about anything else.

Reason 1. TV is changing and educational videos are becoming cooler to watch: We have a big TV at home, but we do not have cable. It is hooked to a small computer with which we can watch all sorts of shit, from the education channel in youtube or some of the TED videos that do not suck major buttocks. By the way have you watched “Justice with Michael Sandel“? highly recommended.

Reason 2. We can now take our education on the road with us: Even I, as a poor grad student can now afford to have a couple of wireless devices. I can load PDF files onto my tablet and read them on the road. Through our university we get full access to books from O’Reilly and Springer. I can read them on the train to work without killing a lot of trees.

John Canny

John Canny is a professor at Berkeley. He is mainly known for this fact that he unlocked the secrets to longer than 24 hours days. When he was at MIT he used only 24 hours of his 70 hour days to invent the Canny edge detector and used the rest to date women. After he got married and settled down he started utilizing the rest of his days to invent things in HCI, machine learning and God knows in many unrelated fields like healthcare and psychology. To this date I am still wondering how he can make all these contributions. Many scientist go into severest depressions as soon as they realize that they can never be John Canny.

3. High quality education is becoming accessible to every fucking idiot: this is the most exciting thing for me. I have probably paid thousands of dollars to UC Berkeley. And I am kind of happy about it. That gives me the privilege to be able to sit in classes taught by professors like John Canny and Micheal Jordan. But I can also stay home and get the same education from the Internet. I can watch Andrew Ng’s machine learning class or Berkeley’s scalable machine learning without paying a penny (well, I pay 40 bucks to those bloodsuckers at  AT&T for the internet but that’s another story). Also if you like these things I highly recommend following John Canny’s Behavioral Data Mining course.

4. Science is actually being used now: When you read something about technology you know that it is being used right now. I was reading a paper about the All-Reduce method and it was great to know that my homeboy, John Langford, has used it in in his Vowpal Wabbit and Yahoo is using it for spam filtering.

5. Science experiments becoming inexpensive: I have an awfully incapable laptop but for a very cheap price I can now get a large cluster crunching numbers for me. My friends at Udacity are now teaching high school kids how to build a kick-ass search engine using commodity computers.

I do not know about you but every day when I wake up I feel freaking blessed that I live in this time. As Salman Khaan says, if these things do not make you excited, you might have no soul 🙂

Disclaimer: I’m indebted to professor Canny immensely. And I have a lot of respect for the man.  This is just a joke.

Notes from Neel Sundaresan’s keynote speech at RecSys 2011

•October 26, 2011 • Leave a Comment
Neel Sundaresan

Neel Sundaresan

He started by stating that he won’t have any greek symbols in the talk.

Arch West was the inventor of Doritos and David Pace we the inventor of Pace sauce. What they did was that they noticed the can sell more if they advertise the two products together. There is a lesson in cross-selling and recommender systems that we can learn from this story.

eBay started when Pierre Omidyar wanted to sell his broken laser pointer. He listed the laser pointer online for 99 cents and finally sold for 14 dollar. Was wondering if the person who bought it knew that it is broken. The guy responded “Yes I am a collector of broken laser pointers”

Why people buy something? it is hard to say. Some people buy stories! remember the toast that sold for 27K that was the cheapest marketing campaign that a casino had.  One man’s trash is another’s treasure.

The long tail in eBay’s context mean most people sell very few items and most of eBay’s revenue comes from these people. (i.e the mean is way larger than the median

The users constantly are running experiments to maximize their revenue. They are constantly testing to see if free shipping can sell more, different selling strategies are being tested by users at any time on eBay.

This causes an interesting behavior. If you promote a user’s product on the homepage they may increase their price! This is an interesting dynamics between the user and the seller (eBay)

One of the problems that locations like eBay have is the problem of big data. Complex algorithms are often impossible to work with in that scale. If you are looking for a job at eBay you need to know how to work with data in that scale.  A goal at eBay lab is that when a new scientist joins the lab on Monday he got access to all the data by Friday.

This amount of data has changed how economics is doing experiments. They can now run experiments on 400 million data points.

What are you optimizing for at eBay? is it profit maximization? do you want to increase the shopping cart size? are you looking for maximum customer satisfaction?

The other thing is how do you measure success?

Everything we do at eBay is a recommendation.

I KEEP six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who.

— from The Elephant’s Child 

When we look at the tag cloud of eBay you see keywords like “used”, “vintage” and “antique” a lot more than “new”

The search is an interesting problem some people are looking for “ipod nano 4gb black new” and some are looking for the skin for their ipod. Our search engine should be able to differentiate between “ipod nano 4gb black new” “ipod nano 4gb black new skin”. This proposes a hard and challenging research questions.

Click trails can help us tremendously with building recommender systems that can capture these behavior and improve recommender systems.  At eBay a data cleanup is an important part of recommender. Specially when they use click trails.

eBay has a language like pig that allows them to do pattern recognition at scale. Sometimes a search is followed by some page views and another search. This pattern is useful to do recommendation to other users who have similar initial search queries. See two recent papers from Sundaresan for the results and model.

Fashion item buyer on eBay are very brand aware. Sometimes ebay does not have enough inventories and needs to recommend proper products from outside websites.

one of the challenges at eBay is that we do not have a catalogue of items (remember the laser pointer story?) Amazon does not have such a problem, you cannot sell anything on Amazon unless it is on the catalog.

eBay uses its own matrix factorization see their ICML paper. The sparsity in eBay’s data is fascinating it is 100 times the sparsity in the Netflix data.

eBay clusters items into pseudo products using LDA. He shows an example of a recommendation for a broken blackberry cellphone.

The most important thing is “why” are you recommending this to the user and “why” they should buy it. HCI is a useful tool here. reveal to the user why you are recommending. Something like “52% of the people who bought this item also bought …” are very effective.  Be very explicit on why certain recommendations are made.

Let’s look at “When”. things like reminders, post purchases, urgency, upgrades, seasonal sales fall into this. Reminders can be like “you have viewed this item” that reminds people that they can still go and buy. There is a temporal element to this problem too. They may not need the same item until after 30 days but need to buy it again after 30 days is passed.

See this wired article on persuasion based profiling and recommendation systems. (thanks to twitter).

There is a lot of seasonality on eBay. Mother and Father’s days, Christmas. There are other events that are we don’t know (so my question is how can we find them algorithmically)

We get more data from mobile devices than we get from online. It is a huge research opportunity.

Notes from From Understanding to Enabling Networks: Using Web Science to Enhance Recommender Systems

•October 24, 2011 • Leave a Comment
Noshir Contractor

Noshir Contractor

The keynote at #recsys2011 is by Noshir Contractor. He is the coauthor of “Theories of communication networks” which seems to be an interesting book from amazon reviews.

the presentation stack is available here (thanks to @barrysmyth for the link)

He started by presenting SNIF. SNIF is a device and social networks for dogs! Kind of social petworking. In contrast lovegety is the SNIF technology for people. Find love through random encounters.

Today we will talk about How we can take research in social sciences and bring it to recommender systems.

People have looked at citations and papers and found that people who write papers in teams have a high impact. Also articles by teams from different disciplines from different geographic locations have the highest impact. Fining the appropriate team from a diverse background and geography is much harder.

Thus we are looking at assembeling these type of teams. But how do we decide whom to bring to the team?

The exciting thing about our time is that we have theories, data and methods, additionally we have computation infrastructure to run these models

Why do people collaborate with each other?

MTML model:

  • self interest (from econ theories)
  • Social and resource exchange
  • Mutual interest and collective action
  • Theories of contagion
  • Theories of balance
  • Theories of homophily
  • Theories of proximity
My note: How about Robert Spolsky’s theory?
Exponential randome graphs can explain how these collaboration networks is formed (the shape of the graph)
They have looked at the structure of NSF proposals and they wanted to see if they can build a recommnder system that by using characteristics of the proposal make recommendations for acceptance
The likelihood of collaboration is highers if:
  • you have written an NSF proposal together
  • you have cited each other
Didn’t know about H-index. Interesting factor. Apparently those with higher H-index are less likely to collaborate.
Citing your collaborators actually reduces the likelihood of getting NSF funding (!)
Solving the link recommendation problem (recommending who should be on the team)
Link prediction approaches: node-wise similarity, network topology, or probabilistic modeling
P* for link prediction
Use p* models to calculate link probability
  • Estimate p*/ERGM
  • the rest I didn’t get to type (!)
I think the probabilistic model that he is refering to is the same as model fitting on Bayesian nets but I am not sure.
The talk ended by a demo of the implementation that is available here 
Noshir’s book is also available for free on his personal website


Notes from “Recommendations as a Conversation with the User” by Daniel Tunkelang

•October 24, 2011 • Leave a Comment
Daniel Tunkelang

Daniel Tunkelang

These are my unedited notes from Daniel Tunkelang’s presentation at #recsys2011. I am editing as you are reading this post.

“Recommendations as a Conversation with the User” by Daniel Tunkelang
Goal is to have a better relationship with the user

Three take aways from this talk:

  • Consider asking vs guessing
  • Ask good questions
  • It’s okay to make mistakes if you have a good explanation and adapt to feedback


The Man Who Lied to His Laptop”  is a great related read
Paul Grice’s maxims of conversations:

  1. Quality
  2. Quantity
  3. Relation
  4. Manner

**Do not lie

  • Don’t use “recommended” when you really mean “sponsored” or “excess inventory”. User’s loss of trust will cost you. but users do not have a model on how on how to trust a system
  • Optimize for the user’s utility
  • Apply a standard of evidence (quality, quantity) that you believe in

Right amount of information

  • Exchange small units of information
  • If recommendations supplement other content consider overall cognitive load
  • provide short meaningful explanations

Maxim 3: Relation. Relevant to the user

  • Offer value to the user
  • respect task context
  • don’t be obnoxious

Maxim 4: Manner

  • relevant to the user
  • Eschew obfuscation
  • Avoid ambiguity
  • be brief
  • be orderly

Another perspective

Another perspective is Gary Marchionini’s perspective on Human computer information retrieval

Empower people to explore large-scale information but demand that people also take responsibly for the control be expending cognitive and physical energy

principles of hcir

  1. do more than deliver information: facilitate sense-making
  2. require and reward effort
  3. adapt to increasingly knowledgeable users over time
  4. be engaging and fun to use

Adapt to user knowledge
Systems that don’t get better over time will frustrate users, because users DO get better over time

Personalized recommendations

  • be transparent about model so users gain insight
  • allow users to modify models to correct
  • solicit just enough information to provide value
  • Exemplars are interesting tools to communicate the recommender model to the user
  • Users should be able to modify the recommender system say you have a recommender system that uses location and user is using a proxy. He should be able to turn if off to make it noncreepy!

Social recommnedations

  • identify the right set of similar users
  • allow users to manipulate the social lens
  • accommodate users who break your model

When making item recs, explain your recommendations! Watch for non-sequiturs (diapers -> beer problem)

**Tell me about yourself is friendlier than “fill out 20 pages of survey”

Corpse bride is in the recommnded set and I have watched it, it is good. it gives me the feeling that recommender is working properly

Learning from netflix

  • Ask users for help upfront but not too much help
  • pay attention to what the user tells you
  • give users value often and early

75% of netflix views result from recommendation

Underpromissing and overdelivering is sometimes a good idea

Soe models more explainabel than others

  1. consider decision trees and rule based models
  2. avoid using latent, unlabled features
  3. if the model is opaque use exaples as surrougates

Make a good first impression
your user’s first experience is critical

See “Machine learning for large scale recommender systems” by Agrawal and Chen ICL 2011 Tutorial

“We Will All Be Jedi Masters Soon” or “Random But Coherent Thoughts on Modern Education”

•October 12, 2011 • 2 Comments

Three things have happened recently:

1- We are living in an exciting era. Stanford is offering their AI course online for free and I am telling you, it is not the crappy study material that MIT dumps on their open course ware website. These are serious, well curated videos with quizzes and assignments.  Basically what it means is that a brown kid in the desserts of south Oman can now learn what a wealthy full-of-himself Stanford kid in Palo Alto learns about AI .

2- For my PhD I worked on the intersection of human-computer interaction (HCI) and machine learning (ML). It took me a couple of years and I can assure you, you can find online and free educational material to become as bad-ass as I am in both of the fields. Ironically the number of educational videos on machine learning outnumbers the number of videos on CHI!

3- Steve Jobs died a couple of days ago. As an open source contributor I hated him while he was around. But I felt extremely sad when passed away. Let’s all face it, he might have been an ass when it comes to treating others or exploiting and abusing child labor. But he lead many amazing projects. Apple was a mecca for HCI people. They really set the standard for innovation in consumer products. I have recently found this video by Steve Jobs in which he emphasizes tool building and how computers are making us super humans. And it is very true. The guy is a visionary.  In his words we can quote him, disagree with him, glorify or vilify him, but the only thing we can’t do is to ignore him because he changed our lives forever.

4- In terms of computing we are living in an amazing time too. I work in a lab that has a pretty strong cluster of computers and 90% of the time the load on the cluster is not that much. Do you see it? It means we are entering an era that our computing power is way more than our computing needs! And if we utilize our computing power well we may actually have excess cycles. Things like Hadoop have allowed us to treat multi-computers like single computers and run massive jobs on them. Do things that were impossible before.

I guess what I am saying is that:

The other blog post is claiming that we need to be awesome otherwise we will loose our jobs. What I am saying is that the cost of becoming awesome is decreasing dramatically. With all these free courses, education is becoming cheap (while schooling becomes more and more expensive), our  tools are getting better and I also believe that seeing our friends on facebook/twitter/google+ has given us an incentive for self improvement and encourages us to learn and educate ourselves more. It has become much easier to push ourselves to become a Jedi master.