The quest for insights,

the true objective of big data


John Alexis Guerra Gómez
@duto_guerra


http://infovis.co/bigDataQuest

Outline

  1. Who am I?
  2. What is Big Data?
  3. How to make sense of it?
  4. Insights!!!

Who am I?

PhD

Silicon Valley

What is Big Data?

You might have heard of the Vs of Big Data

  • Volume
  • Velocity
  • Variety
  • and Veracity and Value
  • Too ambiguous!! Let's go beyond that

How Big is big?

Can you fit it in one computer?

Yes? -> Then is not really big

Let's call it big data only if it doesn't fit on one computer (and has the 3Vs)

Why this criteria?

Because if it fits in one computer you don't need all the overhead of big data technologies, just use a traditional relational database.

Example: photo collection

  • One photo -> 10MB
  • 1k photos in a cellphone -> 10MB * 1k = 10000MB = 10GB
  • 50k photos in your computer -> 10MB * 50k = 500GB
  • Is that big data?
  • No, you can fit that in one cheap external hard drive

Problem: count how many blue photos in my collection?

How do you compute this?

Put all your photos in one computer

Go through all the collection and count

Flickr size

80+ trillion photos (80'''000''000'000.000)

That's big data

How many blue photos on Flickr?

How do you compute this?

Distribute the data among hundreds of thousand of computers (a cluster).

Compute subtotals on each chunk of the data. (Map)

Aggregate the subtotals into one big total. (Reduce)

How many computers do you need?

total / one computer capacity?

What if one computer breaks down?

We need redundancy -> Each photo is stored in many computers

How do we control versions? How to keep records? What goes where?

That's why we need big data!!

Technologies

  • MapReduce (Hadoop, Hive, pig, Spark ...)
  • NoSQL Databases (Redis, Cassandra, MongoDB, Neo4J)
  • Distributed Relational (SQL) Databases (MySQL, PostgreSQL, Oracle, SqlServer)
  • Many others

Hadoop

  • Computing platform for big data
  • Uses clusters for storing and processing the data

Hadoop Architecture

Spark

A distributed computing alternative of to map reduce.

  • Easier to use
  • Integrates better with traditional programming models

NoSQL Databases

  • Scalable storage platforms that use techniques different to traditional SQL databases
  • Sacrifices features for performance

Types of NoSQL

  • Column Oriented: Cassandra, HBase, Redshift ...
  • Key-value: Redis, memcached, Aerospike ....
  • Document based: MongoDB, CouchDB, DynamoDB ...
  • Graph based: Neo4J, Titan, ...

Bonus

Introduction to NoSQL for Web Developers

Distributed Relational DB

  • You can also use traditional databases on a distributed way.
  • Divides the database into shards.
  • Usually doesn't scale that well.

Others

  • Google DataFlow
  • Google's replacement for MapReduce based on flows.
  • Supposed to scale better.
  • AFAIK can only be used with Google's Cloud.

Making sense

How to make sense of it?

  • Statistical Analysis
  • Machine Learning and Artificial Intelligence
  • Visual Analytics (and data analytics)

Data Mining/Machine Learning

Information Visualization

Infovis + Algorithms

Traditional

  • Query for known patterns
  • Display results using traditional techniques

Pros:
  • Many solutions
  • Easier to implement

Cons:
  • Can’t search for the unexpected

Data Mining/ML

  • Based on statistics
  • Black box approach
  • Output outliers and correlations
  • Human out of the loop

Pros:
  • Scalable

Cons:
  • Analysts have to make sense of the results
  • Makes assumptions on the data

InfoVis

  • Visual Interactive Interfaces
  • Human in the loop

Pros:
  • Visual bandwidth is enormous
  • Experts decided what to search for
  • Identify unknown patterns and errors in the data

Cons
  • Scalability can be an issue

Why should we visualize?

Anscombe's quartet

Anscombe's quartet

Anscombe's visualized

In Infovis we look for Insights

  • Deep understanding
  • Meaningful
  • Non obvious
  • Actionable

How do I do it?

What do I use?

Insights

FDA

Task: Change in drug's adverse effects reports

User: FDA Analysts

State of the art

https://treeversity.cattlab.umd.edu/

Health insurance claims

Task: Detect fraud networks

User: Undisclosed Analysts

Clustering

Force in a box

Overview

Ego distance

Tweetometro

Task: Twitter behavior during Presidential Elections

User: Me

http://tweetometro.co

Normal tweets

Weird tweets?

Creation dates

Number of followers

What car to buy?

Task: What's the best car to buy?

User: Me

Normal procedure

Ask friends and family

Problem

That's inferring statistics from a sample n=1

Better approach

Data based decisions

http://tucarro.com

Take home message

  • Big data? Sure, If it doesn't fit on a computer
  • Finding insights, that's what matters
  • Visual Analytics, a good way of finding insights

Thank You

Questions?

John Alexis Guerra Gómez

johnguerra.co
@duto_guerra

Bonus

Types of Visualization

  • Infographics
  • Scientific Visualization (sciviz)
  • Information Visualization (infovis, datavis)

Infographics

Scientific Visualization

  • Inherently spatial
  • 2D and 3D

Information Visualization

Infovis Basics

Visualization Mantra

  • Overview first
  • Zoom and Filter
  • Details on Demand

Perception Preference

Adapted from from:Tamara Munzner Book Chapter

Data Types

1-D Linear Document Lens, SeeSoft, Info Mural
2-D Map GIS, ArcView, PageMaker, Medical imagery
3-D World CAD, Medical, Molecules, Architecture
Multi-Var Spotfire, Tableau, GGobi, TableLens, ParCoords,
Temporal LifeLines, TimeSearcher, Palantir, DataMontage, LifeFlow
Tree Cone/Cam/Hyperbolic, SpaceTree, Treemap, Treeversity
Network Gephi, NodeXL, Sigmajs