Overview (Details Below)


To register, click RSVP near each event listing (below the Table Overview), or go the Eventbrite page here.


Session 1 [5:30 PM unless otherwise noted]
Session 2 [7:30 PM unless otherwise noted]
11/03MonThomson Reuters[6:00 PM] Kickoff Event and First Session: Using Data and Analytics to Make Elephants Dance (Andy Palmer)[7:45 PM] Second Session: Winning Kaggles and Other Data Science Competitions (Owen Zhang)
11/04Tuehack/reduceQuantifying Uncertainty: Evaluating Trading Algorithms using Probabilistic Programming: Thomas WieckiVector Space Word Representations: Rani Nelken
TueDataXu15 Petabytes of Data, 1M Requests per second, 5 continents: A look inside DataXu’s Big Data Engine: Sandro Catanzaro and Bill Simmons
Tue[See below or at EventBrite][6:00 PM] You Have the Data, Now What? by Chris Lynch.
11/05Wedhack/reduceData Science on a Budget: Maximizing Insight and Impact: Nicholas ArcolanoHow to Quantify Culture: Introduction to R Workshop: Ethan Fosse
WedCICIn Defense of Imprecision: Why Traditional Approaches to Data Visualization are Changing: Mark SchindlerVisualizing Networks: Lynn Cherny
WedMicrosoft NERDBuilding Fast Applications for Streaming Data: Ryan BettsBig But Personal Data: How Human Behavior Bounds Privacy and What We Can We Do About It: Yves-Alexandre de Montjoye
11/06Thuhack/reduceWhat is Big Text? A Case Study of the 2014 World Cup: Catherine HavasiMulti-Armed Bandits and Reinforcement Learning in Computational Advertising: Michael Els
ThuCIC[5:00 PM] Data Science to the Rescue of Healthcare Costs: Ramesh Kumar (Venture Cafe Networking Event) [7:00 PM] Neural Networks, Deep Learning, and Financial Modeling: Eric Morris
ThuThomson ReutersThe Shape of Data: An Intuitive Introduction to the Geometry Behind Machine Learning/Data Learning: Jesse JohnsonMining Big Data to Solve a Specific Customer Need: Mans Olof-Ors
11/07Frihack/reduceThinking in Data Workshop: David WeismanMining Living Organisms: Inferring Biological Models from Wet-Lab Experiment: Daniel Lobo
FriCIC[5:00 PM] Doing Data Science with Python and Scikit-Learn: The Mystery of the Sicilian Olive Oils: Rahul Dave
11/08SatCIC[8:30 AM] Workshop - An introduction to Bayesian Statistics using Python: Allen Downey


Monday November 3
06:00 PM Boston Data Festival 2014 Kickoff @ Thomson Reuters


The Boston Data Festival gets off to a festive start with an evening of networking and talks. We will provide food and drinks at our registration and networking session where you can mingle and network with fellow festival speakers, data scientist, data enthusiasts, startups and many others with an interest in data. We are delighted to have two very distinguished speakers, Andy Palmer and Owen Zhang (bio and talk details below), join us as our evening speakers. We will then wrap up with a final networking stretch.

Schedule [RSVP]
6:00 – 7:00 PM Registration and Networking: Cocktails/Appetizers
7:00 – 7:15 PM Introducing the Boston Data Festival: Sheamus McGovern
7:15 – 7:45 PM Using Data and Analytics to Make Elephants Dance: Andy Palmer
7:45 – 8:30 PM Winning Data Science Competitions: Owen Zhang | DataRobot
8:30 – 9:00 PM Networking

07:15 PM Kickoff talk – Using Data and Analytics to Make Elephants Dance (Andy Palmer) @ Thomson Reuters

Big Data 2.0 and Analytics 3.0 create an unprecedented opportunity for large companies to become as nimble and innovative as smaller companies by democratizing analytics and decision-making. However, success requires building an information culture, with an emphasis on bottom-up information-seeking and sharing – plus giving people the power to act on the information to make faster and better decisions. Learn why transforming information culture is never easy – but necessary for those companies that want to facilitate the development of a data-driven enterprise. [RSVP]

07:45 PM Kickoff talk – Winning Data Science Competitions (Owen Zhang) @ Thomson Reuters

Owen Zhang is no stranger to data science competitions. He has competed in and won several high profile challenges, and is currently ranked 1st out of a community of 200,000 data scientists on Kaggle. This is an opportunity to learn the tips, tricks and techniques Owen employs in building world-class predictive analytic solutions. [RSVP]

Tuesday November 4
05:30 PM Quantifying Uncertainty: Evaluating Trading Algorithms using Probabilistic Programming (Thomas Wiecki) @ hack/reduce

There exist a large number of metrics to evaluate the performance and risk of a trading strategy. Although those metrics have proven to be useful tools in practice, most of them require a large amount of data and yield unstable results on shorter timescales. Quantopian allows users to develop and launch trading algorithms that invest in the stock market. In order to identify stellar quants and connect them with investors, estimating performance with few data points becomes critical. Bayesian modeling is a flexible statistical framework well suited for this problem as uncertainty can be directly quantified in terms of the posterior distribution.

Thomas will briefly provide an overview of Bayesian statistics and how Probabilistic Programming frameworks like PyMC can be used to build and estimate complex statistical models. He will then show how several common financial risk metrics like Alpha and Beta can be expressed as a probabilistic program. Finally, he will apply this type of Bayesian data analysis to evaluate the performance of anonymized real-world trading algorithms running on Quantopian.


05:30 PM 15 Petabytes of Data, 1M Requests per second, 5 continents: A look inside DataXu’s Big Data Engine @ DataXu

The scale of big data in 2014 is bigger than ever. Amounts of data beyond comprehension are being processed every day, and there’s some serious tech going on behind the scenes that powers this big data engine.

Join DataXu co-founders Sandro Catanzaro and Bill Simmons as they peel back the many layers of technology that enable DataXu to respond to more than 1 million requests per second across more than 50 countries on 5 continents. From Amazon Web Services to Hadoop, find out what makes DataXu’s Big Data engine tick during this informative session.

Schedule [RSVP]
5:30 – 6:00 PM Check-in with Food and Drinks
6:00 – 6:45 PM Presentation followed by Question and Answer Period
6:45 – 7:30 PM Networking/Snacks/Drinks

06:00 PM You Have the Data, Now What? (Chris Lynch)

See EventBrite Registration Page for details [RSVP]

07:30 PM Vector Space Word Representations (Rani Nelken) @ hack/reduce

Natural Language Processing (NLP) traditionally mapped words to discrete elements without underlying structure. Recent research replaces these models with vector-based representations, efficiently learned using neural networks. The resulting embedding not only improve performance on a variety of tasks, but also show surprising algebraic structure.

This talk will provide a general introduction to these exciting developments, providing participants with a foundation to understanding NLP. [RSVP]

Wednesday November 5
05:30 PM Data Science on a Budget: Maximizing Insight and Impact (Nicholas Arcolano) @ hack/reduce

Many companies have “big data”, but not every company has the resources (or need) for a big data team. In this talk we will discuss lessons we’ve learned from working as part of a small team within a fast-moving mobile start-up and techniques for getting the most out of your data on a budget. [RSVP]

05:30 PM In Defense of Imprecision: Why Traditional Approaches to Data Visualization are Changing (Mark Schindler) @ Cambridge Innovation Center

In the worlds of research, science, and academia, much attention is given to precision, objectivity, and de-biasing data. The ability to “lie with data” is a legitimate concern. In business analytics, and the burgeoning area of consumer-facing data visualization, though, complete objectivity is not the singular goal.

This talk focuses on situations in business and personal decision making processes where data is filled with subjectivity, creativity and intuition.[RSVP]

05:30 PM Building Fast Applications for Streaming Data (Ryan Betts) @ Microsoft NERD Center

Data is moving at blinding speeds, generated by a wide range of sources – from mobile phones and sensors to a variety of connected “smart” devices. This data, most valuable the moment it arrives, will continue to increase in both volume and variety. Leveraging data instantly provides the opportunity to make real-time decisions, reduce risks, and sense patterns, delivering the competitive edge to react quickly and correctly.Stream processing was not designed to serve the needs of modern Fast Data applications. Despite its ability to rapidly ingest data, streaming requires additional code – and a database – to maintain state. This adds application complexity, moving performance bottlenecks to another component in the system. The results are systems that don’t meet the requirements of modern applications.

Employing a solution that handles streaming data, provides state, ensures durability, and supports transactions and real-time decisions is key to benefitting from fast data. During this presentation participants will learn: (a) The difference between real-time analytics and real-time decisions (b) How streaming applications deliver more value when built on an in-memory, NewSQL database; and (c) that making fast data smart is a significant market opportunity that requires a new database platform designed for the volume, variety and scale of high-speed data. [RSVP]

07:30 PM Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We Do About It (Yves-Alexandre de Montjoye) @ Microsoft NERD

We’re living in an age of big data, a time when most of our movements and actions are collected and stored in real time. This data offers unprecedented insights on how we behave as a species.

In this talk, the author focuses on anonymous location data to show how few points–approximate places and times–are enough to identify individuals, even when no “private” information, such as names, e-mails or phone numbers were collected. The talk will focus further on the case of mobile phone data, showing that phone data is not “just” metadata and that a lot can actually be inferred about an individual by looking at the way he/she uses his/her phone. Finally, the speaker will discuss the impact of metadata on society and some of the legal and technical solutions that are currently being developed at the MIT Media Lab. [RSVP]

07:30 PM How to Quantify Culture: Introduction to R Workshop (Ethan Fosse) @ hack/reduce

This workshop provides an overview of R for those who are unfamiliar with statistics or programming. The first part of the workshop will review the basics of R as an object-oriented statistical programming language, with an emphasis on creating and manipulating objects. The second part will focus on the fundamentals of data analysis, including how to load and manipulate data sets, summarize and visualize variables (using bar graphs, scatter plots, and histograms), and understanding relationships among variables through the fitting of statistical models (such as linear regression, analysis of variance, and classification techniques). No knowledge of statistics or programming is required. [RSVP]

07:30 PM Visualizing Networks (Lynn Cherny) @ Cambridge Innovation Center

Network data are increasingly pervasive, but can be hard to work with. The naive first pass at a network diagram usually looks like a “hairball.” However, adding simple network measures like degree, betweenness, centrality, and community membership, will enable the creation of a more comprehensive network representation.

This talk focuses on improving end-user experience of network visuals, including ubiquitous force layout, and other alternative layout options and interaction techniques. The examples shown will be primarily from D3.js and Gephi. [RSVP]

Thursday November 6
05:00 PM Data Science to the Rescue of Healthcare Costs (Ramesh Kumar) @ Cambridge Innovation Center

More and more data is being collected and made available in our health care system. Mckinsey Report projects that $300bn can be identified and saved in our US healthcare system through better data analytics. But how do we do it? Where is the money in our system? What kind of data is available? What type of analytics is going to disrupt our health care system? What are the opportunities?

Come listen to four leading innovative companies and entrepreneurs share their approaches to using healthcare data to reduce our healthcare costs. [RSVP]

05:30 PM The Shape of Data: An Intuitive Introduction to the Geometry Behind Machine Learning and Data Mining (Jesse Johnson) @ Thomson Reuters

Most experts in data analysis think about data in terms of (relatively simple) geometry. However, in many introductory sources, the geometry is hidden under layers of technical details. The goal of this talk is to put the geometry front-and-center, giving the audience a perspective that will help them to continue exploring data. [RSVP]

05:30 PM What is big text? A case study on the 2014 World Cup. (Catherine Havasi) @ hack/reduce

Twitter has changed. The growth and worldwide success of Twitter has surfaced natural limitations of hashtags and keyword searches. What was once a mechanism for organizing information for efficient consumption is now often rendered obsolete by overwhelming volume and diversity of discussion. Listening to a large number of posts on a given hashtag becomes unproductive as the conversations spawned within and around these hashtags are indistinct and drowned. For SONY, Luminoso restored the ability to understand, consume, and participate in large-scale social media discussions by automatically eliminating spam, removing duplicates, clustering thematically similar conversation, and surfacing meaningful discussion. SONY launched the World’s first dedicated football social network – One Stadium Live – a network that enabled fans and media to experience the 2014 FIFA World Cup like never before, harnessing Luminoso’s technology. The One Stadium Live platform provided users with automatically curated topics of discussion, allowing them to participate in and follow conversations they found interesting, while avoiding the ones they didn’t – across 6 different languages. [RSVP]

07:00 PM Neural Networks, Deep Learning, and Financial Modeling (Eric Morris) @ Cambridge Innovation Center

The talk will focus on the history, structure, and general applicability of artificial neural networks and deep learning with attention to use in financial modeling problems.

The recent growth and resurgence of utilizing artificial neural networks for machine learning problems in both academia and industry has been led by tremendous research and development by Geoffrey Hinton at University of Toronto, Yann LeCun at NYU, Yoshua Bengio at University of Montreal, and Andrew Ng at Stanford. This powerful and versatile machine learning tool can be applied to a wide range of tasks, especially in big data problems, where making a priori assumptions is difficult (or impossible) and there exist significant nonlinearities. Using artificial neural networks for machine learning tasks has promoted improvement in the state of the art performance on benchmark datasets. How will you apply artificial neural networks to create value in your business area and define a new peak in performance? [RSVP]

07:30 PM Multi-Armed Bandits and Reinforcement Learning in Computational Advertising (Michael Els) @ hack/reduce

This talk will cover the most common learning strategies to solve the multi-armed bandit problem. It will involve a python simulation environment to illustrate how the system changes under different assumptions and how prior learning can influence and seed the system. The talk will also discuss the perspectives of the computational advertising framework at MaxPoint where these types of strategies to algorithmically learning are employed to enable optimal ad serving behavior. [RSVP]

07:30 PM Mining Big Data to Solve a Specific Customer Need (Mans Olof-Ors) @ Thomson Reuters

How to use a hackathon as a catalyst to bring together content, technologies and expertise from across a company to come up with an innovative solution to establish entity similarities (comparable companies). [RSVP]

Friday November 7
05:00 PM Doing Data Science with Python and Scikit-learn: The Mystery of the Sicilian Olive Oils (Rahul Dave) @ CIC

The talk will cover an example of the entire data science process, from cleaning the data to exploring it, and from visualizing the data to analyzing it. It will then delve into the machine-learning process to classify the data, with multiple techniques, and gain an understanding of the tradeoffs involved in making predictions. Finally, participants will learn methods to ensure the predictions are robust. [RSVP]

05:30 PM Thinking in Data Workshop (David Weisman) @ hack/reduce

This hour long beginner-level workshop (no laptop needed) focuses on critical thinking about data. The speaker will focus on examples of sampling problems, biases, outliers, confounding, and spurious correlations, and show how these lead to wrong conclusions. The workshop will also show how exploratory data analysis through visualization can bring clarity. Participants will take away a greater awareness of data itself and be able to apply these ideas to their data science projects. [RSVP]

07:30 PM Mining Living Organisms: Inferring Biological Models from Wet-Lab Experiments (Daniel Lobo) @ hack/reduce

Many living organisms have an extraordinary capacity to self-generate and self-repair complex patterns and shapes. To elucidate the mechanisms driving these poorly-understood processes, biologists are producing an extraordinary complex dataset of surgical and genetic experiments. In this talk, the authors present their approach based on formal ontologies, evolutionary computation, and in silico simulators to automate the discovery of biological models from wet-lab experiment, which they hope will pave the way to revolutionary biomedical applications. [RSVP]

Saturday November 8
08:30 AM Workshop – An introduction to Bayesian Statistics using Python (Allen Downey) @ CIC

This is an introduction to Bayesian statistics using Python. Bayesian statistics methods are usually presented mathematically, but many of the ideas are easier to understand computationally. People who know some Python have a head start. Note: We will use material from Think Bayes, published by O’Reilly Media and available free at

Bayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started. People who know Python can use their programming skills to get a head start. I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems. Participants will work hands-on with example code and practice on example problems.

Participants should have at least basic Python and basic statistics. If you learned about Bayes’s theorem and probability distributions at some time, that’s enough, even if you don’t remember it! Participants should bring a laptop with Python and matplotlib installed. You can work in any environment; you just need to be able to download a Python program and run it. I will provide code ahead of time to help with set up. [RSVP]

Highlights from the 2014 Boston Data Festival
01:00 Talks

We had talks on topics from “Quantifying Uncertainty: Evaluating Trading Algorithms using Probabilistic Programming” by Thomas Wiecki to “Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We Do About It” by Yves-Alexandre de Montjoye.


Click here for the full list of our 2014 Event speakers.

2014-11-06 20.12.22


2014-11-06 19.02.53


2014-11-05 21.34.12



02:00 Workshops

We had well-attended workshops on Python Scikit-learn as well as practical Bayesian Analysis.


2014-11-08 11.40.42