2014 Boston Data Festival
November 3-8, 2014

Boston's second annual Data Festival brings together the meetup community, entrepreneurs, VCs and others to highlight its data-centric scene.
twitter BDF

Program Schedule
More Info
Monday November 3
06:00 PM Boston Data Festival 2014 Kickoff @ Thomson Reuters


The Boston Data Festival gets off to a festive start with an evening of networking and talks. We will provide food and drinks at our registration and networking session where you can mingle and network with fellow festival speakers, data scientist, data enthusiasts, startups and many others with an interest in data. We are delighted to have two very distinguished speakers, Andy Palmer and Owen Zhang (bio and talk details below), join us as our evening speakers. We will then wrap up with a final networking stretch.

Schedule [RSVP]
6:00 – 7:00 PM Registration and Networking: Cocktails/Appetizers
7:00 – 7:15 PM Introducing the Boston Data Festival: Sheamus McGovern
7:15 – 7:45 PM Using Data and Analytics to Make Elephants Dance: Andy Palmer
7:45 – 8:30 PM Winning Data Science Competitions: Owen Zhang | DataRobot
8:30 – 9:00 PM Networking

07:15 PM Kickoff talk – Using Data and Analytics to Make Elephants Dance (Andy Palmer) @ Thomson Reuters

Big Data 2.0 and Analytics 3.0 create an unprecedented opportunity for large companies to become as nimble and innovative as smaller companies by democratizing analytics and decision-making. However, success requires building an information culture, with an emphasis on bottom-up information-seeking and sharing – plus giving people the power to act on the information to make faster and better decisions. Learn why transforming information culture is never easy – but necessary for those companies that want to facilitate the development of a data-driven enterprise. [RSVP]

07:45 PM Kickoff talk – Winning Data Science Competitions (Owen Zhang) @ Thomson Reuters

Owen Zhang is no stranger to data science competitions. He has competed in and won several high profile challenges, and is currently ranked 1st out of a community of 200,000 data scientists on Kaggle. This is an opportunity to learn the tips, tricks and techniques Owen employs in building world-class predictive analytic solutions. [RSVP]

Tuesday November 4
05:30 PM Quantifying Uncertainty: Evaluating Trading Algorithms using Probabilistic Programming (Thomas Wiecki) @ hack/reduce

There exist a large number of metrics to evaluate the performance and risk of a trading strategy. Although those metrics have proven to be useful tools in practice, most of them require a large amount of data and yield unstable results on shorter timescales. Quantopian allows users to develop and launch trading algorithms that invest in the stock market. In order to identify stellar quants and connect them with investors, estimating performance with few data points becomes critical. Bayesian modeling is a flexible statistical framework well suited for this problem as uncertainty can be directly quantified in terms of the posterior distribution.

Thomas will briefly provide an overview of Bayesian statistics and how Probabilistic Programming frameworks like PyMC can be used to build and estimate complex statistical models. He will then show how several common financial risk metrics like Alpha and Beta can be expressed as a probabilistic program. Finally, he will apply this type of Bayesian data analysis to evaluate the performance of anonymized real-world trading algorithms running on Quantopian.



05:30 PM Big Data for All, All for Big Data: The Cross-Industry impact of Big Data @ DataXu

Big Data has come a long way over the last few years. No longer the domain of IT, big data is infiltrating all industries and departments, with different technologies, use cases and benefits for each.

Join industry leaders in Big Data, including experts from Finance, marketing and DataXu co-founders Sandro Catanzaro and Bill Simmons as they discuss the sea change in big data technologies and how it has changed those industries.

Schedule [RSVP]
5:30 – 6:00 PM Check-in with Food and Drinks
6:00 – 6:45 PM Presentation followed by Question and Answer Period
6:45 – 7:30 PM Networking/Snacks/Drinks


06:00 PM You Have the Data, Now What? (Chris Lynch) 07:30 PM Vector Space Word Representations (Rani Nelken) @ hack/reduce

Natural Language Processing (NLP) traditionally mapped words to discrete elements without underlying structure. Recent research replaces these models with vector-based representations, efficiently learned using neural networks. The resulting embedding not only improve performance on a variety of tasks, but also show surprising algebraic structure.

This talk will provide a general introduction to these exciting developments, providing participants with a foundation to understanding NLP. [RSVP]

Wednesday November 5
05:30 PM Data Science on a Budget: Maximizing Insight and Impact (Nicholas Arcolano) @ hack/reduce

Many companies have “big data”, but not every company has the resources (or need) for a big data team. In this talk we will discuss lessons we’ve learned from working as part of a small team within a fast-moving mobile start-up and techniques for getting the most out of your data on a budget. [RSVP]

05:30 PM In Defense of Imprecision: Why Traditional Approaches to Data Visualization are Changing (Mark Schindler) @ Cambridge Innovation Center

In the worlds of research, science, and academia, much attention is given to precision, objectivity, and de-biasing data. The ability to “lie with data” is a legitimate concern. In business analytics, and the burgeoning area of consumer-facing data visualization, though, complete objectivity is not the singular goal.

This talk focuses on situations in business and personal decision making processes where data is filled with subjectivity, creativity and intuition.[RSVP]

05:30 PM Building Fast Applications for Streaming Data (Ryan Betts) @ Microsoft NERD Center

Data is moving at blinding speeds, generated by a wide range of sources – from mobile phones and sensors to a variety of connected “smart” devices. This data, most valuable the moment it arrives, will continue to increase in both volume and variety. Leveraging data instantly provides the opportunity to make real-time decisions, reduce risks, and sense patterns, delivering the competitive edge to react quickly and correctly.Stream processing was not designed to serve the needs of modern Fast Data applications. Despite its ability to rapidly ingest data, streaming requires additional code – and a database – to maintain state. This adds application complexity, moving performance bottlenecks to another component in the system. The results are systems that don’t meet the requirements of modern applications.

Employing a solution that handles streaming data, provides state, ensures durability, and supports transactions and real-time decisions is key to benefitting from fast data. During this presentation participants will learn: (a) The difference between real-time analytics and real-time decisions (b) How streaming applications deliver more value when built on an in-memory, NewSQL database; and (c) that making fast data smart is a significant market opportunity that requires a new database platform designed for the volume, variety and scale of high-speed data. [RSVP]

07:30 PM Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We Do About It (Yves-Alexandre de Montjoye) @ Microsoft NERD

We’re living in an age of big data, a time when most of our movements and actions are collected and stored in real time. This data offers unprecedented insights on how we behave as a species.

In this talk, the author focuses on anonymous location data to show how few points–approximate places and times–are enough to identify individuals, even when no “private” information, such as names, e-mails or phone numbers were collected. The talk will focus further on the case of mobile phone data, showing that phone data is not “just” metadata and that a lot can actually be inferred about an individual by looking at the way he/she uses his/her phone. Finally, the speaker will discuss the impact of metadata on society and some of the legal and technical solutions that are currently being developed at the MIT Media Lab. [RSVP]

07:30 PM How to Quantify Culture: Introduction to R Workshop (Ethan Fosse) @ hack/reduce

This workshop provides an overview of R for those who are unfamiliar with statistics or programming. The first part of the workshop will review the basics of R as an object-oriented statistical programming language, with an emphasis on creating and manipulating objects. The second part will focus on the fundamentals of data analysis, including how to load and manipulate data sets, summarize and visualize variables (using bar graphs, scatter plots, and histograms), and understanding relationships among variables through the fitting of statistical models (such as linear regression, analysis of variance, and classification techniques). No knowledge of statistics or programming is required. [RSVP]

07:30 PM Visualizing Networks (Lynn Cherny) @ Cambridge Innovation Center

Network data are increasingly pervasive, but can be hard to work with. The naive first pass at a network diagram usually looks like a “hairball.” However, adding simple network measures like degree, betweenness, centrality, and community membership, will enable the creation of a more comprehensive network representation.

This talk focuses on improving end-user experience of network visuals, including ubiquitous force layout, and other alternative layout options and interaction techniques. The examples shown will be primarily from D3.js and Gephi. [RSVP]

Thursday November 6
05:00 PM Data Science to the Rescue of Healthcare Costs (Ramesh Kumar) @ Cambridge Innovation Center

More and more data is being collected and made available in our health care system. Mckinsey Report projects that $300bn can be identified and saved in our US healthcare system through better data analytics. But how do we do it? Where is the money in our system? What kind of data is available? What type of analytics is going to disrupt our health care system? What are the opportunities?

Come listen to four leading innovative companies and entrepreneurs share their approaches to using healthcare data to reduce our healthcare costs. [RSVP]

05:30 PM The Shape of Data: An Intuitive Introduction to the Geometry Behind Machine Learning and Data Mining (Jesse Johnson) @ Thomson Reuters

Most experts in data analysis think about data in terms of (relatively simple) geometry. However, in many introductory sources, the geometry is hidden under layers of technical details. The goal of this talk is to put the geometry front-and-center, giving the audience a perspective that will help them to continue exploring data. [RSVP]

05:30 PM Bringing Coherence to Chaos-Automated Analysis on Large-Scale Social Data for the 2014 World Cup (Catherine Havasi) @ hack/reduce

Twitter has changed. The growth and worldwide success of Twitter has surfaced natural limitations of hashtags and keyword searches. What was once a mechanism for organizing information for efficient consumption is now often rendered obsolete by overwhelming volume and diversity of discussion. Listening to a large number of posts on a given hashtag becomes unproductive as the conversations spawned within and around these hashtags are indistinct and drowned. For SONY, Luminoso restored the ability to understand, consume, and participate in large-scale social media discussions by automatically eliminating spam, removing duplicates, clustering thematically similar conversation, and surfacing meaningful discussion. SONY launched the World’s first dedicated football social network – One Stadium Live – a network that enabled fans and media to experience the 2014 FIFA World Cup like never before, harnessing Luminoso’s technology. The One Stadium Live platform provided users with automatically curated topics of discussion, allowing them to participate in and follow conversations they found interesting, while avoiding the ones they didn’t – across 6 different languages. [RSVP]

07:00 PM Neural Networks, Deep Learning, and Financial Modeling (Eric Morris) @ Cambridge Innovation Center

The talk will focus on the history, structure, and general applicability of artificial neural networks and deep learning with attention to use in financial modeling problems.

The recent growth and resurgence of utilizing artificial neural networks for machine learning problems in both academia and industry has been led by tremendous research and development by Geoffrey Hinton at University of Toronto, Yann LeCun at NYU, Yoshua Bengio at University of Montreal, and Andrew Ng at Stanford. This powerful and versatile machine learning tool can be applied to a wide range of tasks, especially in big data problems, where making a priori assumptions is difficult (or impossible) and there exist significant nonlinearities. Using artificial neural networks for machine learning tasks has promoted improvement in the state of the art performance on benchmark datasets. How will you apply artificial neural networks to create value in your business area and define a new peak in performance? [RSVP]

07:30 PM Multi-Armed Bandits and Reinforcement Learning in Computational Advertising (Michael Els) @ hack/reduce

This talk will cover the most common learning strategies to solve the multi-armed bandit problem. It will involve a python simulation environment to illustrate how the system changes under different assumptions and how prior learning can influence and seed the system. The talk will also discuss the perspectives of the computational advertising framework at MaxPoint where these types of strategies to algorithmically learning are employed to enable optimal ad serving behavior. [RSVP]

07:30 PM Mining Big Data to Solve a Specific Customer Need (Mans Olof-Ors) @ Thomson Reuters

How to use a hackathon as a catalyst to bring together content, technologies and expertise from across a company to come up with an innovative solution to establish entity similarities (comparable companies). [RSVP]

Friday November 7
05:00 PM Doing Data Science with Python and Scikit-learn: The Mystery of the Sicilian Olive Oils (Rahul Dave) @ CIC

The talk will cover an example of the entire data science process, from cleaning the data to exploring it, and from visualizing the data to analyzing it. It will then delve into the machine-learning process to classify the data, with multiple techniques, and gain an understanding of the tradeoffs involved in making predictions. Finally, participants will learn methods to ensure the predictions are robust. [RSVP]

05:30 PM Thinking in Data Workshop (David Weisman) @ hack/reduce

This hour long beginner-level workshop (no laptop needed) focuses on critical thinking about data. The speaker will focus on examples of sampling problems, biases, outliers, confounding, and spurious correlations, and show how these lead to wrong conclusions. The workshop will also show how exploratory data analysis through visualization can bring clarity. Participants will take away a greater awareness of data itself and be able to apply these ideas to their data science projects. [RSVP]

07:30 PM Mining Living Organisms: Inferring Biological Models from Wet-Lab Experiments (Daniel Lobo) @ hack/reduce

Many living organisms have an extraordinary capacity to self-generate and self-repair complex patterns and shapes. To elucidate the mechanisms driving these poorly-understood processes, biologists are producing an extraordinary complex dataset of surgical and genetic experiments. In this talk, the authors present their approach based on formal ontologies, evolutionary computation, and in silico simulators to automate the discovery of biological models from wet-lab experiment, which they hope will pave the way to revolutionary biomedical applications. [RSVP]

Saturday November 8
08:30 AM Workshop – An introduction to Bayesian Statistics using Python (Allen Downey) @ CIC

This is an introduction to Bayesian statistics using Python. Bayesian statistics methods are usually presented mathematically, but many of the ideas are easier to understand computationally. People who know some Python have a head start. Note: We will use material from Think Bayes, published by O’Reilly Media and available free at greenteapress.com

Bayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started. People who know Python can use their programming skills to get a head start. I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems. Participants will work hands-on with example code and practice on example problems.

Participants should have at least basic Python and basic statistics. If you learned about Bayes’s theorem and probability distributions at some time, that’s enough, even if you don’t remember it! Participants should bring a laptop with Python and matplotlib installed. You can work in any environment; you just need to be able to download a Python program and run it. I will provide code ahead of time to help with set up. [RSVP]

Speaker Lineup
More Info

Owen Zhang
Owen Zhang is the Chief Product Officer at DataRobot, and is ranked #1 on the Kaggle ...
More Info

Andy Palmer
Andy Palmer has helped start, fund, found or advise 50+ innovative companies in ...
More Info

Thomas Wiecki, PhD
Thomas Wiecki received his PhD from Brown University where he developed Bayesian models to ...
More Info

Allen Downey
Allen Downey is a Professor of Computer Science at Olin College of Engineering. He has ...
More Info

Daniel Lobo, PhD
Daniel Lobo is a Postdoctoral Associate at the Biology Department at Tufts University. His ...
More Info

Catherine Havasi, PhD
Catherine Havasi has been researching language and learning for nearly fifteen years. She ...
More Info

Jesse Johnson
Before starting at Google, Jesse was a math professor studying abstract geometry and ...
More Info

Ramesh Kumar
Ramesh Kumar is co-founder and CEO of Zakipoint Health, a startup focused on changing ...
More Info

Yves-Alexandre de Montjoye
Yves-Alexandre de Montjoye is a graduate student at the Massachusetts Institute of ...
More Info

Ethan A. Fosse
Ethan A. Fosse is a Ph.D. Candidate in Sociology at Harvard University and Teaching Fellow ...
More Info

Nicholas Arcolano, PhD
Nicholas Arcolano, PhD, Senior Data Scientist, FitnessKeeper, Inc. Nicholas is a data ...
More Info

Mans Olof-Ors
Mans Olof-Ors is Vice President, Product Innovation at Thomson Reuters Catalyst Lab in ...
More Info

Eric Morris
Eric Morris is consulting out of the Cambridge Innovation Center and working on several ...
More Info

Chris Lynch
Chris Lynch joined Atlas Venture in 2012 as a member of the Technology investment team. ...
More Info

Michael Els
Michael Els is Principal Data Scientist at MaxPoint, where he researches, develops, and ...
More Info

Ryan Betts
Ryan Betts is CTO at VoltDB. He was one of the initial developers of VoltDB’s commercial ...
More Info

Rahul Dave, PhD
LxPrior Inc. and Harvard University Rahul is a Data Scientist. He is a partner at LxPrior, ...
More Info

Sheamus McGovern
Sheamus McGovern is a leading technologist with many years experience building complex ...
More Info

Bill Simmons, PhD

Data All-Star Panelist

Dr. Willard (Bill) Simmons, DataXu’s CTO and co-founder, is the brain behind DataXu’s ...
More Info

David Weisman, PhD
David Weisman, Ph.D. is an expert in data science, with over 30 years of accomplishment in ...
More Info

Sandro Catanzaro
Sandro Catanzaro, Co-Founder, Senior Vice President of Analytics and Innovation, is the ...
More Info

Mark Schindler
Mark Schindler is co-founder and Managing Director of GroupVisual.io, a Cambridge, MA ...
More Info

Lynn Cherny, PhD
Lynn Cherny is a local data analysis and visualization consultant who works in Python, R, ...
More Info

Rani Nelken, PhD
Rani Nelken is Director of Research at Outbrain, where he leads a research team focusing ...
More Info