It’s the End of BI as We Know It

March 30, 2017, 10:14 pm

≫ Next: Announcing the Looker Block for Audience Analytics by Permutive

≪ Previous: How Looker Makes Growing a Data Platform an Organic Experience

Finally Google App Data at Your Fingertips

I’ve often referred to the Looker Platform as the ‘3rd Wave’ of Business Intelligence. But as I talk to more and more companies that are winning with data, I realize that Looker is more than just another wave of BI. It actually represents the end of what we’ve come to know as BI altogether.

Some history

BI started with monolithic stacks, reliable but inflexible. It was the era of the data rich and the data poor, where those without access to data starved as they waited (and waited and waited) for someone to finally have time to help them. That frustration led to the second wave: a revolution toward a hodge-podge of disjointed self-service tools where users grabbed whatever data they could lay their hands on and threw it into data cleansing, blending, and visualization tools for analysis.

But that revolution came at a cost. For all the advantages of do-it-yourself analysis, the tools created a mess of silos that didn’t talk to each other, and you lost the ability to speak in a common language with anyone else at your company. You could no longer trust that your interpretation of the data was the same as your colleagues’. Enter data chaos.

But Looker represents something new: that 3rd wave of BI. Where users can access whatever data they need, from one central source of truth that provides a shared language for the entire company. One platform, one tool…no more waiting, no more hunger, and no more chaos.

What happens when the hunger for data is sated??

This is where the magic comes in. When people are no longer starving for data, new potential opens up in what they can imagine doing. The reactive reporting on historical data, owned by a single group within the company becomes obsolete. Data becomes how you do your job, not just how you measure it with dashboards. BI, as we know it, dies.

Once the technical and knowledge barriers break down, the responsibility of looking at the data moves beyond the BI team. Everyone can start asking and answering their own questions and integrating data into their everyday workflows.

Data is no longer just a place to find answers. It’s the place where ideas originate. With every employee looking at the same numbers, sharing the same truth, they’re able to collectively make smarter, more informed business decisions.

This isn’t just a theory. It’s already happening.

Counsyl, for example, runs their billing operations directly from Looker. Each member of their billing team logs in every day to see their personalized dashboard showing where their cases are in the payment pipeline. This isn’t what you’d normally think of a “data-driven job” but at Counsyl it is. And because billers have a deep familiarity with their processes and access to the data, they’ve implemented new optimizations to their billing funnel, bringing more revenue into company.

HubSpot’s sales team can see their progress towards goal right in Salesforce where they work every day. This isn’t a cute visualization and it’s not a vanity metric - it’s the data they need to know to do their jobs and stay motivated, and it’s right there where they need it to be.

BI as a destination is dying

Cobbling together a bunch of siloed, single-purpose tools that don’t communicate with each other will not feed the masses. We know what happens when you ask users to log into four different tools to get four different types of information. They don’t log into any of them.

But when the data lives in one place, users can embrace their curiosity. Data is where the question starts, gets answered, sparks another question, and grows.

That’s the only way data can really work. It can’t be a place that questions go to wait in line until they die. Dashboards and reporting are no longer enough. Organizations that want to move beyond reporting so they can use data to drive questions, answers and everything in between, are turning to a data platform.

For this future to exist, every employee, in every department, needs access to a comprehensive view of the data. Data has to be part of the very fabric of your company, woven into every moment of every day. Then, the assumption in every conversation is that every argument can be backed up with data.

Because fundamentally, the thing that’s changing isn’t the way companies do BI. It’s the way they do business. They’re throwing out the old way of guessing at a solution, then checking next month’s report to see how things turned out. They’re moving to true, iterative, agile data cultures.

The transition can be a little scary at first, but ask the companies that are winning how it’s working out for them. They’ll tell you they feel fine.

Learn more about the Looker data platform in a product demo or tune into a webinar I am hosting on April 13, Winning with the 3rd Wave of BI, to hear more about the future of BI.

↧

Announcing the Looker Block for Audience Analytics by Permutive

April 3, 2017, 9:00 pm

≫ Next: Yes! We Have No Connectors

≪ Previous: It’s the End of BI as We Know It

Finally Google App Data at Your Fingertips

Audience Analytics by Permutive Block - Instantly Compare Your Audiences and Analyse Behaviours

Permutive is the world’s first realtime Data Management Platform, trusted by some of the world’s largest publishers to target their audiences on desktop, mobile and FIA. Permutive enables you to collect, enrich and segment audience data across devices for real-time analysis and targeting. The Audience Analytics by Permutive Looker Block lets you instantly analyse and compare these audience segments, powered by up-to-date real-time data.

We are excited to announce our new Audience Analytics Looker Block, which lets publishers instantly compare, see the overlap and total size of their audience segments. It’s an incredibly flexible Block which can be used for all your audiences to compare any data - how you use it is up to you!

Instant, Customisable, Up-to-Date Audience Development

All publishers need to understand their audience segments. To optimise ROI, build loyalty, for a post-campaign report, there are endless advantages to understanding your audience.

Permutive’s Audience Analytics Block is the ideal tool for these insights. It’s a set of tools for you to quickly investigate the audiences you’ve created in Permutive.

By default, this Block contains 3 dashboards.

The Audience Comparison dashboard is a fantastic tool for quickly comparing any audience’s side-by-side. It includes engagement metrics and can be customised to include metrics on any data you have in Permutive. Use this dashboard to compare audience segments’ pageviews per session, video completion, gender, page yield… any data you’re collecting in Permutive can be compared, by audience, using this dashboard!

Session Length

Pageviews

The Audience Overlap dashboard instantly reports back on the overlap between all your audience segments. For example, for post-campaign sales reporting, how do the audience that viewed the campaign overlap with engaged users or interest categories? The more creatively and purposefully you build your audiences - the more interesting and valuable your insights.

Segments Overlap

The Audience Growth dashboard is a high-level summary of your audience segments’ total size. Use this tool to understand the total uniques who have fallen into each audience.

Time Series

Cumulative Segment Growth

In summary, this block gives you quick, up-to-date access to audience insights. You can customise the block by adding new metrics and comparing new audiences you’ve created.

If you’re interested in using Permutive we’d love to chat to you, please get in touch by emailing team@permutive.com, or visit our website at www.permutive.com

↧

Yes! We Have No Connectors

April 6, 2017, 11:08 pm

≫ Next: Data Science with Looker: Part I

≪ Previous: Announcing the Looker Block for Audience Analytics by Permutive

Yes We have no connectors

Over the last decade, we’ve seen an explosion of SaaS applications. Salesforce, HubSpot, MailChimp, Box, NetSuite, Zendesk, AdWords, New Relic, Dropbox, Facebook Ads and so many more. (Those were literally just the first 10 that came to mind.)

And all of those apps produce data. Data that’s critical to running your business.

While many of those apps include some basic reporting, you need to combine data from all your apps in one place to derive real business value from it. That’s the only way to get a global view of what’s happening and why it happened.

Naturally, lots of customers come looking for an analytics tool that can plug into all their SaaS apps, suck the data out, centralize it, magically make sense of it, and let them find insights.

I’m going to be straight with you: Looker is not that tool.

Other tools won’t say that. They’ll say, “Yes! That’s us!”

It isn’t.

I know, because that tool doesn’t exist. And I’d strongly advise caution before buying what they’re selling.

Opportunity Costs

The first reason you should be wary of anyone promising that magical solution is that keeping up to date with dozens of other vendors’ API specs and customers’ unique needs is hard.

When you buy connectors that are bundled with an analytics tool, you can bet that vendor is spending thousands of engineer-hours building and maintaining their connectors. And those are hours they’re not spending making their core analytics product better.

Buying connectors from your analytics vendor is kind of like buying a TV/DVD combo. You get a mediocre TV and a mediocre DVD player. But the real issue arises when you’re ready to upgrade to a better TV or to replace your DVD player with Blu-Ray. The purchase that started as a time and money saver is now far more complicated and expensive.

Two Flawed Approaches

The second reason to be cautious is that built-in connectors work one of two ways: either they give you a black box or they give you a data dump.

Black boxes are initially appealing, because they look simple from the outside. But that’s only because they’ve hidden the gnarly stuff inside, where you can’t see or understand it. Analytics you don’t understand are analytics that can lead to mistaken decisions.

You don’t know what’s happening in those black boxes, so it’s hard (or impossible) to understand which definition of retention rate or open rate or any of the hundreds of other business-critical definitions are being used.

Black Boxness of Connectors

Because you can’t see inside the black box, you also can’t customize it to your business’s particular needs. So costs skyrocket as you’re forced to pay extra for every bit of customization.

The other path tools take is to offer a data dump instead of a black box. However, without guidance as to what the data actually means, a data dump doesn’t provide much value either. Without a model for what the data means, you have to have create that logic from your own knowledge. And while that might seem reasonable, wait until you see what these data dumps look like.

For example, here’s a “basic” entity-relationship diagram for Salesforce’s data tables (and this is just the sales objects):

Salesforce Data Tables

It’s not simple. It’s not easy to understand. And building out those relationships in your analytic tool is difficult, time-consuming, and error-prone.

Our Approach

At Looker, we chose to focus on building a phenomenal data exploration platform. And spending time on connectors would impede our progress toward that goal. But that doesn’t mean we leave you to fend for yourself when it comes to centralizing your data.

Instead of building connectors ourselves, we’ve cultivated an extraordinary group of partners whose sole focus is moving data seamlessly out of your SaaS apps into whichever data warehouse you choose. Partners like Fivetran, Stitch, and Segment do an amazing job centralizing your raw data from apps like Salesforce, Zendesk, and dozens more. Additionally, by getting your data out of these SaaS apps and into a centralized environment, you have created a store of all the data that drives your business that you truly own.

And then Looker takes over with our Looker Blocks, plug-and-play code modules that make sense of the data for you, so you can start exploring it immediately. Looker Blocks are just LookML (Looker’s flexible SQL modeling language), so you can see exactly how the data is structured and customize it however you need.

It’s not a black box and it’s not a data dump. It’s convenient and transparent, and it’s all done without distracting Looker’s engineers from building amazing features for our customers.

Looking Ahead

What’s more, this is the only approach that is future-proof. We’re already seeing the ecosystem evolve with tools like Google’s new BigQuery Data Transfer Service, where Google pushes YouTube, DoubleClick, AdWord, and Google Analytics data directly into your BigQuery instance. As the cost of keeping up with an ever-growing set of APIs grows, we expect more SaaS vendors will offer the option to have your data delivered to wherever you need it.

Moving data yourself is expensive and slow. Subscribing to a data stream or having the vendor deliver it to cloud storage is getting easier everyday. As more vendors move to this new model, do you really want to be stuck with an analytic tool that you chose for an obsolete feature? Or do you want something that will work with whatever your future stack looks like?

Choose the pipeline that makes sense for you now (we’re happy to help you choose); pipe the data into the database that fits your current needs (we’re happy to help there, too); and put the best data exploration platform on top (happy to help you choose, but I think you know what the answer is going to be 😀)

(If you’ve made it to the end, you deserve a song: Yes! We Have No Bananas)

↧

Data Science with Looker: Part I

April 10, 2017, 9:00 pm

≫ Next: Data Science with Looker: Part II

≪ Previous: Yes! We Have No Connectors

Yes We have no connectors

Data Science with Looker and Python

Looker isn’t just a BI tool. It’s a full data platform, which means you can use it as part of a data science workflow. Looker is great for cleaning data, defining custom metrics and calculations, exploring and visualizing data, and sharing findings with others across your organization, including scheduled data deliveries. For many types of analysis, you can efficiently complete your entire workflow in Looker.

But for more advanced analytics there are specialized tools that are going to work better for the remaining 20% of your workflow. Looker writes SQL, and while SQL is great for a lot of things, you’re not going to use it to run a clustering algorithm or build a neural network. Because of these limitations, data scientists usually turn to Python or R and run them in a dedicated Data Science Environment (DSE) like R Studio or Jupyter Notebooks.

A typical workflow may look like this:

Advanced Analytics Diagram 1

Before a data scientist can start analyzing data in a DSE, they first extract the relevant data from their company’s warehouse. They then spend a considerable amount of time preparing the data, including merging, reshaping, and cleaning this data in their DSE. Once the data has been prepped, advanced functions — whether they are predictive models or optimizations — can finally be written in Python or R. But most advanced analytics workflows are iterative. Models are constantly updated to include additional variables. This means repeating steps 1 and 2 (extraction and data prep) multiple times in order to arrive at an optimal dataset.

Once you finally have something you’re ready to share, you’re faced with the question of how to get the findings out of your Data Science Environment and out to your stakeholders and decision-makers for review. Because visualizing data and then sharing it isn’t the core function of DSEs, is often a labor-intensive process.

So, while a data scientist spends most of their time in the DSE, only a small portion of their time is actually spent doing the thing that the DSE is best at: advanced analytics. Most of their time is spent preparing data and maintaining ways to share that data with the rest of the organization. But Looker is good at exactly the pieces of the workflow that DSEs are bad at. So we’ve seen data scientists adopt a right-tool-for-the-job workflow that lets them get more done, more efficiently. Here’s what that usually looks like:

Advanced Analytics Diagram 2

What’s different about this approach?

Looker lets you accomplish the first steps of extraction, data prep and data exploration far more quickly than in a pure DSE workflow. Then, when you’re ready to tackle the data science portion of this workflow, you use the DSE for what it’s great at: writing advanced predictive models and tuning them.

Looker’s modeling layer allows you to define how tables relate to each other and specify custom metrics in an intuitive, reusable way that lets you iterate, clean, and explore more quickly. Looker automatically writes the right SQL to get access to the data you need, in the shape you need it. And because Looker is a data platform, it’s easy to simply pull in the data you’ve curated (or the pre-written SQL that’s producing it) with a single API call.

Once you’ve finalized your model and are ready to deploy it in production, you can use your DSE to generate predictions or make use of an external tool like Spark, or AWS or Google’s Machine Learning APIs. And as those predictions get pumped back into your database, users can access those predictions right in Looker.

That means anyone in your organization can explore real data and predictions right alongside each other. It means they can visualize them to make sure that the model is continuing to perform well. And they can even schedule data deliveries or alerts from Looker based off of ongoing predictions.

All of this is easy because you’re using best-in-class tools together harmoniously, rather than trying to use one tool for everything.

To show you how easy and effective this workflow can be, look out for our follow-up post walking through a predictive model that we’ve built using Looker, Python, and Jupyter on some neat bikeshare data.

↧

Data Science with Looker: Part II

April 13, 2017, 11:43 pm

≫ Next: Invest in your Most Valuable Asset with HR analytics

≪ Previous: Data Science with Looker: Part I

Advanced Analytics Part II

Data Science with Looker and Python: Part II

To show how seamlessly Looker can integrate into a data science workflow, we took a public dataset (Seattle bikeshare data) and applied a predictive model using Looker, Python, and Jupyter Notebooks.

Follow along as I walk through the setup.

We loaded nearly 2 years (October 2014 - August 2016) of historical daily trip data for Seattle’s bikeshare program into BigQuery. Because we know that the “rebalancing problem” is a key cost driver for bike sharing programs, we thought we’d explore the impact that weather has on trip volume. To do that we’ve imported daily weather data (e.g. temperature, humidity, rain) for Seattle alongside the trip data in BigQuery. Based on our model, we’d like to predict future trip counts by station, but more importantly operationalize those insights to automatically facilitate rebalancing bikes to underserved stations.

How might the weather affect people’s willingness to ride?

First, we want to define relationships between different data points within the bike share data (trip information, station facts, and weather forecasts) in Looker in order to easily start exploring this data. LookML lets us define these relationships once and then explore freely.

We can then look for a relationship between daily trip count and temperature by selecting three fields — Trip Start Date, Temperature, and Trip Count — in the Looker explore panel. Looker automatically writes SQL, retrieves data, and allows us to visualize the results to see if there could be a correlation between temperature and trip count.

Advanced Analytics Diagram 1

A quick examination of the scatterplot indicates a relationship between temperature and ride volume, so let’s build a regression model that predicts trip count as a function of temperature.

Any dataset that we build in Looker can not only be explored in Looker’s web interface, but also in data science environments as well. With a single API call (by passing in a Query ID, Look ID, or the Query Slug), we can access the generated SQL or pull in the result set itself. We then can run a Python script that creates a simple regression model predicting Trip Count as a function of the daily Temperature.

The actual regression model is straightforward and can be written in python by leveraging a Stats package —

Y = Average Trip Count
X = Temperature

model = smf.ols(formula='Y ~ X', data=data).fit()
print(model.summary())

Data Analysis pre Looker

Based on our model we can determine that —

A 1 degree increase in temperature leads to nearly 12 (11.977) additional trips
Temperature has a significant effect on trip count (P > | t| is indistinguishable from 0)
The R² of our basic model is .62 (meaning the temperature can explain 62% of the variability in daily trip count)

Our model predicting trip count based on just temperature is certainly decent, but can we do better? Let’s try to improve on our model by adding other variables. Intuitively, it seems that the trip count will be affected not just by temperature, but also by the humidity level (or the likelihood that it rained on a particular day or the likelihood that there was snow).

To include another factor like humidity in our model, we simply jump back to Looker to further explore the data. We can add humidity as an additional field and pass in the new dataset to our model by grabbing the updated ID.

Explore the Data

Data Analysis

The model now isolates the effect of both temperature and humidity.

For every 1 degree increase in temperature, trip count increases by 7.74 rides.
For every 1 percent increase in humidity, trip count decreases by 4.62 rides
Both variables have a significant impact on trip count.
Our R² has increased to .686

We can continue to go back and bring in additional variables from Looker into our model by simply adding additional fields to the explore page, and making an API call. But for now let’s use temperature and humidity as our two factors or dependent variables that could affect trip count.

For more complex models, we may choose to calculate our predictions in our DSE before piping them back into the database, but for this simple example, we can simply push our model coefficients into our database and use Looker to operationalize these calculations and metrics.

explore: trip {
  join: trip_count_prediction {
    type: cross
    relationship: many_to_one
  }

By joining in trip count predictions (which contains coefficients and intercepts for our linear regression model) in Looker, we can create fields to predict the Average Trip Count based on weather factors. The Trip count calculation, based on the multiple regression model, should look something like the following.

y=β0+β1×Temperature+β2×Humidity+β3×OtherFactorsy=β0+β1×Temperature+β2×Humidity+β3×OtherFactors

  measure: trip_count_prediction{
    type: average
    sql:  (${trip_count_prediction.x0} * ${weather.temperature}) + 
          (${trip_count_prediction.x1} * ${weather.humidity}) + 
          ${trip_count_prediction.intercept};;
    value_format_name: decimal_1
    view_label: "Trip Time Prediction"
  }

We can now can access our predictions right in Looker and can explore real data and predictions right alongside each other. It means that we can also visualize and operationalize this dataset to make sure that the model is continuing to perform well.

Let’s start by comparing our trip counts to those predicted by our model -

Compairing Trips in Looker

Trip Regression

Our predictions seem to be in line with the actual trip counts for those days since a graph of the actual trip count vs the predictive trip count shows a linear relationship. Let’s take this a step further and use our model to forecast average trip times/lengths for future dates based on the same factors we’ve outlined above (the temperature and the humidity).

We can easily pull in weather information for the upcoming week in Seattle (using third-party weather APIs) and use those points to estimate the number of bike share trips. For the following week, as the temperature decreases to the 40s towards the end of the week, our model predicts that around ~250 trips will be taken.

Trips per Date

We can also take into account a degree of error on our weather forecasts for the week. Let’s create additional metrics within Looker that will allow us to define trip counts based on a variance in the temperature.

  filter: weather_variance {
    type: string
  }

  dimension: adjusted_weather {
    type: number
    sql: ${weather.temperature} + CAST({% parameter $weather_variance %} AS FLOAT64) ;;

  }

  measure: trip_count_prediction_what_if {
    type: average
    sql:  (${trip_count_prediction.x0} * ${adjusted_weather}) +
         (${trip_count_prediction.x1} * ${weather.humidity}) +
         ${trip_count_prediction.intercept};;
    value_format_name: decimal_1
    view_label: "Trip Count Prediction"
  }

Trip count upper and lower bound

Taking the results of our model, we could forecast revenue for the upcoming week based on the number of trip counts, or predict average trip length, and a number of other key business metrics to help drive business decisions.

How might Weather factors affect the bike overflow rate at a given station?

Taking our trip count predictive model one step further, we can predict the overutilization rate (number of bikes taken from a station - number of bikes docked at a station) at certain stations on any given day based on weather conditions like Temperature.

We can pull in data from a particular start station (by temperature) and an end station (by temperature) by exploring data with those fields in Looker.

Then, we build a Linear regression model that predicts the number of trips from a start station and an end station, this time grouped by the Station ID. Our new regression model will have coefficients by Station

Linear Regression Output

Now that we have our predictive model, let’s pull it up in Looker and calculate how many Trips were taken per station.

Join in station prediction table in Looker and create fields to predict trips taken based on Temperature.

explore: trip {
  join: station_prediction  {
    type:  left_outer
    relationship: one_to_one
    sql_on:  ${trip.from_station_id} = ${station_prediction.bike_station} ;;
  }

Our predictive measure in Looker, follows the linear regression model that we used and pulls in the coefficients and intercept that have been persisted back into the DB.

Linear Regression Model:

Linear Regression Equation

In our case:

Linear Regression Equation Weather

  measure:  start_predictions {
    type:  average
    sql: (${start_slope} * ${weather.temperature} ) +  ${start_intercept};;
    value_format_name: decimal_1
  }

  measure:  end_predictions {
    type:  average
    sql: (${end_slope} * ${weather.temperature} ) +  ${end_intercept};;
    value_format_name: decimal_1
  }

  measure:  predicted_station_overflow {
    type: number
    sql:  ${end_predictions} - ${start_predictions} ;;
    value_format_name: decimal_1
  }

We can now use Looker to visualize and operationalize this data for future dates. The dimensions and measures we have listed above pull in the weather forecasts for the upcoming week and look like the following:

Rides Data Certain Stations have a high overflow rate (more bikes are returned to that station than are taken out from that station). For example, the 2nd Avenue and Pine St Station is in high demand and will tend to have more bikes docked at that station on this particular day.

Next, we can visualize this information to see if we can spot trends. Because we have our data in Looker, we can effortlessly pull in additional fields about the station like the latitude and longitude of the station and plot the results on a map.

Rides Map

It looks like there is heavy overutilization in Downtown Seattle and by the University district and less utilization around residential areas like Capitol Hill. More people could simply be biking into work and taking a different mode of transportation back home.

What does this mean for the Seattle bikeshare program? Looker could easily send out detailed alerts and scheduled emails to employees every morning with details on predicted utilization patterns. These predictions of bike overflow rates for stations across Seattle, could help guide bike rebalancing.

Pretty cool right?

Looker has made it very easy to operationalize this entire workflow by providing easy access to clean data, providing a framework to easily visualize predictive results, and by making the resulting analyses easy to share across the organization. To learn more about the benefits of using Looker in conjunction with your DSE read our blog post and reach out to a looker analyst for a demo.

↧

Invest in your Most Valuable Asset with HR analytics

April 17, 2017, 9:00 pm

≫ Next: Elastic Analytic Databases

≪ Previous: Data Science with Looker: Part II

HR analytics

In this competitive job market, employers are finding it increasingly difficult to recruit, retain, and motivate top talent. Yet, HR continues to rely on outdated employee engagement strategies that no longer work. As pressures increase from an evolving workforce, companies from startups to enterprises must establish HR & People Analytics now more than ever.

Organizations struggle to understand their complex existing and potential workforce and how to use each effectively. Which applicants should they recruit? Which of their hires do they wish to retain for their performance and productivity? Who amongst their internal talent do they wish to groom for career advancement? What are the most effective compensation, benefits and development options that will optimize the organization’s competitiveness in the marketplace? As we navigate today’s dynamic economy, do we need to retrench again or pursue growth? We want answers to these questions almost on a daily basis, in addition to the most obvious one: what is our headcount?

Built by industry experts on a solid foundation of years of experience in Marketing and Sales analytics across many industries, Datasticians’ HR & People Analytics Looker Block is built to address the above mentioned reporting needs. Since it is operating on the Looker platform, with all business logic residing in a centralized modeling layer, there’s never a need for manual reconciliation, and all analytics are surfaced in real-time.

The Block for HR & People Analytics works with data from Recruitment systems (JobVite, Kenexa, Taleo, Success Factors, etc.), HR data (Workday), and Professional Networking Social platforms, and ties them together into a single central warehouse (Redshift, Snowflake, RDBMS, Google BigQuery, etc.) to allow users to gain insight from a blended data spectrum.

Users can extend the Block to integrate compliance related data to monitor related metrics. HR no longer needs to run individual reports, but now from a single dashboard launch pad, can monitor US EEO, AAP, Vets100, and other compliance reporting to view the state of compliance to determine if action is necessary.

Dashboards include but are not limited to:

Overview
Workforce Effectiveness
Recruitment
Employee Performance

Current Customers should reach out to their Looker Analyst.

Not a customer? Give Looker a free trial today.

↧

Elastic Analytic Databases

April 21, 2017, 12:10 am

≫ Next: Culture Comes from the Inside Out, not the Outside In

≪ Previous: Invest in your Most Valuable Asset with HR analytics

Elastic Analytic Databases

I used to buy VHS tapes. It was great. I could buy my favorite movies and watch them whenever I wanted. Except most of the time I owned a movie, I wasn’t actually watching it. And when DVDs came out, I had to replace my whole collection to upgrade.

Then video rentals came along and I mostly stopped buying movies. Why pay to keep a movie on the shelf when I could access a vastly larger pool of movies whenever I wanted? Except, of course I was limited to one movie at a time, and sometimes the movie I wanted wasn’t available.

That’s why Netflix’s streaming model was such a revelation. Suddenly, I had access to an enormous catalog of movies in an instant and they were never out of stock. And if I wanted to binge watch, I could, without having to drive to the store to trade discs.

Movies transformed from a physical thing I bought, to a shared service I leased a fixed part of, to a utility that scales seamlessly to meet my current needs.

Analytic databases are following the same path.

At first, your only option was to buy a database of a given size, with given performance limitations. At certain times it got used a lot (closing the quarter or doing some big transformation). On other days it might not get used much at all. But not using it didn’t save you any money, because you’d paid up front. And you were stuck with the featureset and power you initially bought.

In the last five years, analytic databases moved to the cloud. That was wonderful, because you no longer had to worry about maintaining them and you could scale them up in a matter of hours. But you still were leasing a fixed part of the cloud and you still paid even when you weren’t using that part.

But now we have offerings like Google BigQuery, Snowflake, and AWS Redshift with Spectrum. They’re Netflix. They scale instantly in response to your current needs.

Elastic Analytical Databases

We call this new breed of warehouses “Elastic Analytical Databases”. These databases can scale up and down as needed depending on your workload. They share three key features that contribute to their elasticity:

Elastic Storage and Compute are Separate
Elastic databases are built on cloud storage systems like Google Cloud Storage or Amazon S3, which means they’re built to handle exabytes of data. This also means that you never need to worry about running out of storage. Because storage is relatively cheap (as compared to compute power), this makes it attractive to store all your data where the elastic database can access it.

Elastic Computing Power Scales (Almost) Infinitely
Elastic Databases have access to vast computing resources, so they can pull as many machines as are necessary to execute your workload. This all happens in a single database architecture—every query runs against the entire cluster. This means that elastic databases can return query results in a relatively consistent amount of time, no matter how large or complex the query is. Some architectures allow you to specify the size of cluster for a given workload, so you can trade speed and size, but just for a long as you need it to complete your workload.

Usage-based Model, Rather than Ownership-based Model
Most of these databases use permissioning, rather than physical separation, to keep different customers’ data separate. This allows providers to smooth demand by sharing power across many customers. So instead of paying to own your own instance (whether you’re currently using it or not), you pay to get access and use the instance for the time you need it.

Unlimited Storage

Cloud storage that has the capacity to scale seamlessly no matter how much data you put in it is a key component of elastic databases. Whereas disk usage is a long-time headache for database admins, Amazon S3 and Google Cloud Storage ensure that you never hit capacity thresholds.

Not having to worry about running out of space is freeing. Just like managing Netflix’s infrastructure isn’t your responsibility, unlimited storage means you no longer need to focus on manually scaling up your cluster size with increased data volume. It lets you focus on doing good analysis, rather than capacity planning.

Unlimited Computability

Elastic database are built to handle many workloads at once, so they have to scale out. Scaling up won’t work. That makes them very powerful. The same way that Google returns results to any question in seconds by hitting hundreds of machines in parallel, queries against an elastic database are similarly performant.

At the 2016 Google Cloud conference, I watched as a Googler ran a query against a petabyte of data. That’s the equivalent of ~110 consecutive years worth of movies. Querying that almost unthinkable amount of data took about two minutes. Similarly, a 4TB query run against a table with 100 billion rows stored in BigQuery returns in under 30 seconds.

The number of individual machines and amount of network bandwidth needed to run a similar query against a non-elastic database is massive, and this doesn’t even include the technical skill and resources needed to manage such a system. And when you’re managing those resources yourself, annoying things like failing servers are your problem. With an elastic database, Google or Amazon or Snowflake handles those failures gracefully in the background.

With an elastic database, the resources needed to run an enormous query are trivial because the system designed to run these databases is much, much larger. These database providers can loan you these vast resources for seconds at a time because they’re only a fraction of a fraction of the entire collection of machines that make up an elastic cluster.

The result is that your queries are never slow, and the cluster never goes down.

A New Usage Model

Unlike other analytical databases, you share only one instance of BigQuery or Athena with thousands of other customers. You dump your data into BigQuery or Athena, both of which are managed for you, and just start querying. No more spinning up or sizing your own instance. With Snowflake, you simply ask for as many machines as you need to perform your function, but only for as much time as you need it, essentially forming an on-demand army of machines that you can use for only the time you need.

With Netflix, you’re not paying for the movie itself. Rather, you’re paying to be a part of the Netflix network and use their services. With elastic databases such as BigQuery, Snowflake, and Athena, you’re not paying to own the database instance itself. Instead, you’re paying to be a part of the instance, and leverage their massive compute power for seconds at a time.

Looking to the Future: Sharing Instead of Moving

Today, if you have data in Salesforce, Zendesk or any other SaaS tool, you have to use their API to move that tool’s data into a database in order to query it. Some APIs are better than others, but by the time the data arrives it is out of date. How out of date the data is depends on the mechanism you are using but that delay can range from minutes to days.

As elastic databases take hold, I believe more and more of these SaaS tools will move to make their data available directly where these new databases can access them. Google’s recent introduction of BigQuery Data Transfer Service does exactly this. Just like sharing a Dropbox or Google Drive folder instead of downloading files to a USB drive, we’re moving to a world where data will be shared, rather than moved.

The data will therefore always be fresh and up to date. This instantaneous sharing stays constant no matter how big the data gets because it’s never moving.

Will Elastic Analytical Databases Win?

We still don’t know how the data community will adapt to the new paradigm of elasticity. Will the benefits be enough to shift the data community towards these elastic databases? Or will unforeseen obstacles prevent them from taking hold?

At Looker, our unique architecture lets our customers immediately take advantage of these new developments in database technology to see if they’re right for them. It may be worth learning more about elastic databases to see if they’re the right fit for your organization; we think they’re pretty amazing.

Want to learn more about how Looker works with elastic databases? Read our full BigQuery Whitepaper, our page on Looker + AWS, and this page about our partnership with Snowflake.

↧

Culture Comes from the Inside Out, not the Outside In

April 28, 2017, 12:52 am

≫ Next: 5 Tips to Becoming a Data-Driven Marketer

≪ Previous: Elastic Analytic Databases

perks

Looker was recently announced as the #3 mid-sized company and #1 mid-sized tech company on the Silicon Valley Business Journal’s 2017 Best Places to Work list. When people ask me about what makes Looker so special, they often want to know about our perks.

“Do you provide free food? Do you offer unlimited Paid Time Off? Do people go surfing during the day? Do you provide whiskey on tap?” The answers are “yes; yes; if the swell is just right; and no.” But if you are trying to understand what makes Looker so special, those are the wrong questions to ask. Those are but outward manifestations of much deeper beliefs that permeate Looker and are core to who we are as a company.

Take unlimited paid time off, for example. On the surface that might be interpreted as a somewhat lax approach to performance or at least a lower level of intensity. Some might attribute that to our headquarters being in Santa Cruz or to something you have to offer in order to compete for talent. Dig into that one a bit and you’ll see that it’s really about autonomy. Lookers work hard. We are on a four-week release cycle and that quick pace permeates the company. I’ve never seen people work harder to make sure the customers who are getting that release are ready and supported. Because we know Lookers will do whatever it takes to meet the needs of those customers, we trust them to know what they need to do to meet their own needs — to de-stress, re-charge, and be at their best. We call it “making time to shred.”

That commitment to the team is born out of commitment to the mission, commitment to the customer, and commitment to the product. I think the various clubs we sponsor facilitate that commitment. The clubs allow us to connect with each other at a deeper level. When we see a fellow Engineer attack a particularly tough trail and not get tossed off their bike, we have newfound respect for them. When we see an Account Executive struggling to get up on a surfboard for the first time, we are inclined to root for them. Those feelings of respect and support translate into the work environment. The trust built during these activities carries over into the office. The next time someone with whom you’ve spent two hours playing board games challenges your ideas, you trust that they are doing just that — challenging your ideas, not challenging you.

If you look at our perks at a deeper level you will see they are manifestations of the environment Lloyd Tabb and Ben Porterfield, our Founders, have been creating from Day One — an environment of emotional safety and helpfulness; autonomy and empowerment. And I have never seen a group of employees more protective of their culture and more committed to making sure that culture scales as we grow than I’ve seen at Looker.

My first week at Looker I had bruises on my thighs because I kept literally pinching myself to see if I was dreaming. The commitment our employees have to our customers and to each other is magic, and it’s our collective job to make sure we do everything we can to foster and nourish that — from weekly Happy Hours to providing a home-cooked, healthy lunch for the company to sit down and enjoy together twice a week. Our spending time with each other and truly getting to know one another is the fertile soil that allows our great product and fanatical commitment to our customers to truly flourish.

The next time you interview for a job, ask someone if they know why their company offers the perks they offer. See if they can articulate a deeper set of beliefs behind those perks. If you are interviewing at Looker, I’m confident in the answers you will hear.

↧

5 Tips to Becoming a Data-Driven Marketer

April 29, 2017, 1:04 am

≫ Next: Using New York City Taxi Data to Avoid Airport Traffic

≪ Previous: Culture Comes from the Inside Out, not the Outside In

perks

Most people can pull up a pretty visualization or create a wonderful presentation with numbers. But as a marketer, and also someone who has taken advanced statistics courses, numbers are meaningless unless you trust and understand the metrics behind them.

We live in a world where every application shows you metrics - in our marketing stack I could pull a report in Marketo, Salesforce, Google Analytics, Optimizely and probably ten more tools that contribute to our department’s success.

The problem with using these tools individually is that they only give you a piece of the puzzle and not the whole picture. And odds are, I wouldn’t even get the same numbers from all of those tools.

As a program manager, I work with leadership to figure out how my piece of the marketing puzzle contributes and drives the top level metrics that affect the business. The key metrics for both marketing and the top level are defined through several dashboards in Looker. Once I have those reports and data dashboards, I backtrack and think of my ability to take action from a tactical point of view. What parts of my program can I optimize to make marketing, and more specifically, demand generation, more successful?

Contributing to the company means exposing your results in a consolidated data analytics platform that brings everyone’s efforts together. Every program manager at Looker is in tune with the company’s overall marketing strategy and KPI’s and using Looker is an essential part in understanding how they can help drive the business forward.

Here are the 5 things I’ve discovered.

Find your technical champion. You need at least one person focused on marketing analytics for either your department or organization — a technical champion. They can work with you to figure out how to consolidate or ETL your data sources and then model them out. This is actually really difficult, but is the ticket to success in having a voice. Investing time and energy upfront with that champion to understand the setup of your metrics (in our case, using our Looker data model) will pay off in spades. This legwork, while time consuming, will allow you to understand exactly how the metrics you rely on are defined. After all, you are the one making decisions with your data.

Get the right data. Figuring out the bigger picture of what data is significant to collect, track, and analyze is extremely important. Work with your managers or technical teams to be certain you are tracking and collecting everything you need.

Define metrics once, and then leverage them endlessly. My time is not spent defending how my efforts fit into the success of the company, but rather making my program and numbers grow. I only had to spend time once with our analytics manager discussing “Is this the right way of giving marketing attribution?”.

Embrace the power of data at your fingertips. Ask me anything about any of our publishers. Anytime. How did one publisher perform in October 2016 in terms of meetings, pipeline and closed business for the sales team? Give me two minutes, give or take for internet speed. How did that group compare to a paid webinar by a different publisher? Now we’re at four minutes. The great thing about having this level of access is that at some point, asking the data becomes the only thing to do. It’s so easy and so fast, that doing anything else feels like a waste of time. If that’s not power at your fingertips, I’m not sure what is.

No more egos. This data-driven culture helps ease tension in the workforce. As a millennial, getting results from trusted data is power. Employees spend time working on their program rather than politicizing for a promotion. A point can be proved with no subjectiveness or arbitrary opinions.

I can’t imagine working for a company that didn’t operate like this. Without a centralized data platform that gave me - a marketer - access to all the data I need - every step of the way, I wouldn’t be able to work as efficiently as I do today. More than that, I wouldn’t be a self-sufficient, data-driven marketer with beautiful dashboards I trust with….well, my job!

↧

Using New York City Taxi Data to Avoid Airport Traffic

May 8, 2017, 9:00 pm

≫ Next: Finance: How we Looker at Looker

≪ Previous: 5 Tips to Becoming a Data-Driven Marketer

exploring taxi data

As Looker’s New York team, we’re constantly using taxis to get around the city. Taxis are a huge part of New York City culture and they play a key role in our team’s productivity. Since data is central to everything we do at Looker, we thought it’d be interesting to see if we can use data to optimize the taxi experience.

New Yorkers rely on taxis, that much is clear to anyone who’s been there and it was immediately clear in the data. The data also supports a hunch that many New Yorkers have; it is nearly impossible to get a cab at 4:30 p.m.. This is due to the unfortunate alignment of a shift change and rush hour.

Manhattan Taxi

Members of our NYC team are constantly heading to the airport to meet with customers or to fly out to our Santa Cruz HQ, so we decided to use this opportunity to gather some data and use it to improve our travel experience. (As opposed to our current ‘system’ of simply relying on the advice of our team’s native-New Yorkers).

By looking at NYC taxi trip data (count, location, and trip length), we formulated some tips for our own taxi-taking strategy.

Catching Taxi

Good News for Morning People

The data confirms that the best time to catch a cab to the airport is 4:00 a.m. At this prime hour, the average cab ride to takes 20 minutes compared to the 41 minutes it takes to travel 3:00 p.m. - the worst time to try to get to the airport.

Since most of us agree that hailing a cab at 4:00 a.m. is ridiculous (unless you’re coming home from an amazing night out), the next best times to get a taxi are between 10:00 a.m. – 12:00 p.m.

We were also surprised to find that traffic from Manhattan towards both Laguardia and JFK is relatively smooth at 7:00 a.m.; We expect this is because early birds are already at their destination for the day. Also, the bulk of the 7am traffic is heading into Manhattan, rather than outbound.

If an early morning trip isn’t possible, we advise you avoid scheduling trips to the airport between 1:00p.m. – 6:00 p.m. Overall, the data states that if you’re heading to the airport, an early or late travel time will serve you better than an afternoon trip.

Image: Length of Taxi Trip (in minutes) by Hour of Day

Best Days to Taxi to the Airport

If you want to optimize your taxi-time to the airport based on day of the week, Monday is clearly your best bet. Want to be extra swift? Nothing beats Monday at 4:00 a.m.

Alternately, Thursday is the worst day catch a cab to the airport with traffic at its peak from 3:00 p.m. clear through to the end of rush hour. Next time you decide to take Friday off for a three-day weekend, you may want to consider leaving on Thursday at lunchtime, or waiting until Friday morning to skip the peak traffic and taxi demand time.

Worst Locations to Catch a Taxi to the Airport

The data confirmed our collective suspicion that Times Square and Rockefeller Center are the absolute worst spots to catch a taxi in NYC. (Yet another reason for New Yorkers to avoid the touristy spots!)

manhattan heatmap Image: Heatmap of pickup locations for taxi trips headed to the airport

While our deep dive into New York City Taxi data validated many of our hunches, it also revealed some great travel tips, less-than-optimum hailing locations and travel times to avoid.

If you’d still like to explore and optimize your own taxi strategy, we’ve the loaded our findings into Looker’s public data with BigQuery, and you can access it here.

If you’d like to learn more about doing cool things with data in Looker, or optimizing your business strategy with data, request a chat with our technical team.

Wishing you safe and data-driven travels,

The Looker Team, NYC

perks

↧

Finance: How we Looker at Looker

May 14, 2017, 2:00 am

≫ Next: Query Exabytes of Data in AWS with Looker’s Native Support of Amazon Redshift Spectrum

≪ Previous: Using New York City Taxi Data to Avoid Airport Traffic

Finance

Looker is growing very quickly. While our growth is dramatic, our ability to scale using Looker may be best demonstrated (in my not-so-humble opinion) by the Looker Finance Department.

When I joined Looker a few years ago, I was the first finance hire outside of our CFO. Today, after several years of 100%+ Yr/Yr growth, two new business entities, and roughly 250 more employees, I have a team of two and we use Looker to help manage and track so much of what we do.

How we Looker

As an FP&A professional, we have several core systems that we leverage to do our jobs and provide the company with key metrics and guidance: a General Ledger & ERP system (Netsuite), a Budgeting or Financial Planning tool (Adaptive Planning), a CRM (Salesforce.com), and often a BI tool to help with reporting and Analytics (Looker).

Outside of those tools is the finance professional’s swiss army knife/crutch: Microsoft Excel. It’s often a “go to” for ad hoc analysis and creating reporting packages for things like Operational reviews and Board of Directors packages. It’s a bold statement, but I’ve spent drastically fewer hours in Excel since coming to Looker than I have in my previous work experience.

At Looker, we’ve been able to centralize the data from all of the above systems in a single database, and then use Looker (the tool) to build our reporting logic into LookML (the modeling layer).

The secret sauce that allows my team to scale our reporting infrastructure is Looker’s reusability. We can build everything once and then reuse that same report framework at the close of every fiscal period.

This has enabled us to drastically cut down the time to value for Monthly/Quarterly KPI’s because our reporting package is pre-built within Looker. Once the accounting team closes the fiscal period, I just press “Run” on our Executive dashboard and I’m instantly presented with updated Revenue, Gross Margin, CAC, MTR, MRR/ARR, and just about anything else I choose to dig into.

Because we can drill in and do our ad hoc analysis directly within Looker, we rarely do any .csv exporting of files and massaging in Excel. Not only does this save time, but it also removes the element of potential human error while still being able to drill into the details of a given report.

NERD ALERT: One of the coolest use cases we’ve built in Looker is our A/R aging report.

We have 900+ customers of various billing structures and payment schedules, all managed by one person on the Accounting team. Looker makes this possible because we can join our Netsuite A/R data with the matching customer record from Salesforce to bring in Account Manager and other information about the account. This data allows us to leverage the sales relationship to collect on any delinquent accounts, and allows us to slice and dice GAAP compliant financials by the infinitely many fields that can be created in Salesforce.

ERP software is often rigid and not meant for the capturing the same type of business specific customer data that Salesforce enables. In the traditional FP&A reporting structure we would have to take the two systems’ Excel extracts and mash them together in Excel via manual effort. With Looker, we mapped the fields together one time and again leverage the reusability of the modeling layer. Although I can’t disclose the exact state of our A/R as a whole, we are very happy with our ability to fund a large portion of our daily business operation through our collections.

As our team continues to support Looker’s incredible growth, I’ve found scaling the business by leveraging Looker to be one of the most rewarding parts of my work. As FP&A professionals, we’re constantly looking for efficiencies and ways to improve or do something better. Looker presents me with those opportunities in my work every day, and I wouldn’t be able to do my job without it.

↧

Query Exabytes of Data in AWS with Looker’s Native Support of Amazon Redshift Spectrum

May 17, 2017, 9:00 pm

≫ Next: JOIN 2017: 5 Things You Don't Want to Miss

≪ Previous: Finance: How we Looker at Looker

AWS Spectrum

At the AWS Summit on Wednesday, April 19th, 2017, Amazon announced a revolutionary new Redshift feature called Spectrum. Spectrum significantly extends the functionality and ease of use of Redshift by allowing users to access exabytes of data stored in S3 without having to load it into Redshift first.

Based on the resounding cheers of the crowd at the Summit and early interest from Looker customers, it’s a feature that people are pretty excited about. Looker customers on Redshift can take advantage of the feature today to maximize the impact of both technologies.

To highlight the speed and power of Spectrum during his Keynote speech, AWS CTO Werner Vogels compared the performance of a complex data warehouse query running across an exabyte of data (approx. 1 billion GB) in Hive on a 1,000 node cluster versus running the same query on Redshift Spectrum. The query would have taken 5 years to complete in Hive, and only 155 seconds with Spectrum at a cost of a few hundred dollars.

An Overview

In addition to the obvious convenience factor of being able to directly access data stored in S3, the ability to query this data directly from Redshift in S3 allows Redshift users to access an exceptional amount of data at an unprecedented rate. To boot, this will lower costs for users while giving them more granular data than ever before. Prior to Spectrum, you were limited to the storage and compute resources that had been dedicated to your Redshift cluster. Now, Spectrum provides federated queries for all of your data stored in S3 and dynamically allocates the necessary hardware based on the requirements of the query, insuring query times stay low, while data volume can continue to grow. Perhaps most importantly, taking advantage of the new Spectrum feature is a seamless experience for end-users; they do not even need to know whether the query they ran is executed against Redshift, S3 or both.

Other benefits include support for open, common data types including CSV/TSV, Parquet, SequenceFile, and RCFile. Files can even be compressed using GZip or Snappy, with other data types and compression methods in the works.

What This Means For Lookers

Spectrum will allow Looker users to dramatically increase the depth and breadth of the data that they are able to analyze. Extremely complex queries can now be run over vast amounts of data at unprecedented scale.

Looker’s native integration, combined with Redshift Spectrum offers the possibility of an infinitely scalable data lake/data warehouse, exciting new modeling opportunities, and expanded insights for businesses. Data stored in Redshift and S3 can be modeled together, then queried simultaneously from Looker. For example, extremely large datasets, or datasets that are subject to extremely complex queries, can be stored in S3 and take advantage of the processing power of Spectrum. The new feature will also allow for hybrid data models, where newer data is stored in Redshift and historical data is stored in S3, or where dimension tables and summarized fact tables are stored in Redshift, while the underlying raw data is stored in S3. A well designed data storage architecture, combined with the power of a Looker model, will allow Looker users to easily traverse between data aggregated from terabytes of raw data stored in the original Redshift cluster, down to the individual events that comprise the aggregations that live in S3.

With this functionality users have access to more data, deeper insights and blazing fast performance from one data platform.

Pricing for data stored in and queried against your existing, relational Redshift cluster will not change. Queries against data that is stored in S3 will be charged on a per-query basis at a cost of $5 per terabyte. Amazon is providing several recommendations on how the data in S3 should be stored that will minimize the per-query costs. These same recommendations also maximize query performance.

Spectrum gives customers unparalleled ability to leverage their data. As companies are collecting more and more data, they need ways to store and process that data quickly and cost effectively and Spectrum is an elegant solution to that problem. Users will always ask to dig deeper, and databases like Redshift with Looker on top allow companies to store, process and expose the sheer quantity of data required to enable that. It’s interesting to watch, and I look forward to seeing how Redshift continues to innovate in this rapidly evolving market.

Read more about Spectrum

↧

JOIN 2017: 5 Things You Don't Want to Miss

May 21, 2017, 2:39 am

≫ Next: Why Nesting Is So Cool

≪ Previous: Query Exabytes of Data in AWS with Looker’s Native Support of Amazon Redshift Spectrum

5 things join

JOIN is back! And this time we’re setting up shop in San Francisco.

The goal for JOIN is simple: Provide an intimate environment where smart data people can meet each other, mingle with our Looker team and walk away feeling like they actually learned something.

To get you excited for JOIN 2017, here are five things we have in the works.

Unfiltered Access to the Looker Team: Have a question for Looker founder, Lloyd Tabb? Curious about our product roadmap? Secretly hoping to meet your chat support team member IRL? Then come to JOIN 2017. Our entire team is ready to roll up their sleeves, drill down into queries, talk through LookML tricks and tips or just hang out and chat data.
Carefully Curated Content: Our content team is busy working on sessions that will help you develop — or advance — your Looker skills. Need help getting started with Looker? Want to explore practical skills? Is there a complex model you need to build? Come to JOIN where there will be a session or workshop designed to help you see what’s possible when you just keep asking.
Speakers — who are just like you: There are a million things you can do with Looker, but often times it’s hard to discover them all. Come to JOIN to learn about the innovative ways people in your industry, including folks from BlueApron, GitHub, PDX and more, are using Looker to make better business decisions that foster a data-driven culture.
Product Updates and Announcement: Remember LookVR? That was announced at last year’s JOIN. This year, our product and engineering teams are heads-down working on some **secret** projects that we can’t wait to share with you. Wish we could say more now, but this is our teaser point to get you excited — and curious — about the future of Looker.
Food, Fun and a Fantastic Party: What’s a conference without good food, plenty of drinks, spirited parties and cool swag? Our events team knows a thing or two about throwing a memorable tech conference. Lucky for you, they know what works. Food will be better than your mom’s best home cooked meal, drinks will be flowing, and we promise you’ve never seen swag like this before. Plus, we’re planning a party. As for the parties… we can’t tell you much more, other than be prepared to sleep in the next day.

There you have it! Five things you won’t want to miss at JOIN 2017. The good news is we have a few more delights in store. So, block your calendar (September 13-15), book a flight (to SFO), and register today.

See you at JOIN!

↧

Why Nesting Is So Cool

May 28, 2017, 3:09 am

≫ Next: Article 0

≪ Previous: JOIN 2017: 5 Things You Don't Want to Miss

Why Nesting is so Cool

When you’re setting up a data warehouse, one of the key questions is how to structure your data. And one of the trickiest types of data to deal with is hierarchically structured data (like orders and the order_items that make them up; or the pageloads that make up sessions).

Do you completely normalize the data into a snowflake schema? Completely denormalize it into a very wide table with lots of repeated values? Or do something in the middle like a star schema?

Historically, that’s a decision you had to make as you were designing your warehouse. And it always involved trade offs. Denormalization (one big wide table with lots of repeated values) takes up more space. But by eliminating joins, it’s faster for some queries. Normalization, on the other hand, avoids repeated values and yields more compact tables. So it’s more space efficient and makes other types of queries faster.

But some of the newest warehousing technologies, like Google BigQuery, offer a third option: don’t decide, nest.

To explain what this means and why it’s so powerful, let’s imagine a (really) simple dataset describing orders and the items that make up each:

Order 1, 2017-01-25: {2 Apples @ $0.50 each, 3 Bananas @ $0.25 each}
Order 2, 2017-01-26: {4 Carrots @ $0.20 each, 1 Date @ $0.60 each, 1 Endive @ $2 each}
Order 3, 2017-01-27: {2 Fennel @ $1 each, 2 Bananas @ $0.25 each}

One way to structure this data would be in a snowflake schema:

Snowflake Schema

This makes counting orders (SELECT COUNT(*) FROM orders)really cheap since it doesn’t need to worry about any of the dimensions of those orders. But getting revenue by day requires two joins (and joins are usually expensive):

SELECT
  order_date,
  SUM(unit_price * quantity)
FROM
  orders
  JOIN order_items USING(order_id)
  JOIN items USING(product_id)
GROUP BY 1
ORDER BY 1

Alternatively, you could completely denormalize the data into one big, wide table:

Orders Denorm

That makes the revenue query dead simple and very cheap:

SELECT
  order_date, 
  SUM(quantity * unit_price) 
FROM 
  orders_denorm 
GROUP BY 1
ORDER BY 1

But now to get a count of orders, you have to do:

SELECT COUNT(DISTINCT order_id) FROM orders_denorm

That scans 7 rows instead of 3; plus DISTINCT is expensive. (And yes, when your table is 7 rows it’s not a big deal, but what if it’s 7 billion?)

So neither is ideal. Really, what you’d like is not to choose your schema until the last moment, when you know what kind of query you’re running and can choose the optimal schema. And that’s what nesting does.

So here’s what the same data looks like nested in BigQuery:

Nested BigQuery

Basically, you’ve nested a whole table into each row of the orders table. Another way to think about it is that your table contains a pre-constructed one-to-many join. The join is there if you need it, at no additional cost, but when you don’t need it, you don’t have to use it.

Now, when you just want to know how many orders there were, you don’t have to scan repeated data or use DISTINCT. You just do:

SELECT COUNT(*) FROM orders

But when you do want to know more about the items in the orders, you simply UNNEST the order_items fields. This is far less computationally expensive than a join, because the data for each row is already colocated. That means the unnesting can be performed in parallel (whereas joining, even when joining on a distribution key, must be performed serially).

The SQL for unnesting can be a bit strange at first, but you quickly get the hang of it. And luckily, Looker can write it for you anyway. Here’s how you’d find out the total revenue for the last three days:

SELECT 
  order_date,
  SUM(order_items.quantity * order_items.unit_price)
FROM 
  orders
  LEFT JOIN UNNEST(orders.order_items) as order_items
GROUP BY 1
ORDER BY 1

(You’ll note that even though there appears to be a LEFT JOIN in the query, since you’re just joining a nested column that’s colocated with the row it’s joining, computationally this join is almost costless.)

To sum up, when you’re dealing with data that is naturally nested, leaving it nested until query time gives you the best of both worlds: great performance no matter what level of the hierarchy you’re interested in.

↧

Article 0

May 30, 2017, 11:48 pm

≫ Next: It’s the End of BI as We Know It

≪ Previous: Why Nesting Is So Cool

↧

It’s the End of BI as We Know It

March 28, 2017, 5:00 pm

≫ Next: Announcing the Looker Block for Audience Analytics by Permutive

≪ Previous: Article 0

Finally Google App Data at Your Fingertips

Some history

What happens when the hunger for data is sated??

This isn’t just a theory. It’s already happening.

BI as a destination is dying

But when the data lives in one place, users can embrace their curiosity. Data is where the question starts, gets answered, sparks another question, and grows.

The transition can be a little scary at first, but ask the companies that are winning how it’s working out for them. They’ll tell you they feel fine.

Learn more about the Looker data platform in a product demo or tune into a webinar I am hosting on April 13, Winning with the 3rd Wave of BI, to hear more about the future of BI.

↧

Announcing the Looker Block for Audience Analytics by Permutive

April 3, 2017, 5:00 pm

≫ Next: Yes! We Have No Connectors

≪ Previous: It’s the End of BI as We Know It

Finally Google App Data at Your Fingertips

Audience Analytics by Permutive Block - Instantly Compare Your Audiences and Analyse Behaviours

Instant, Customisable, Up-to-Date Audience Development

All publishers need to understand their audience segments. To optimise ROI, build loyalty, for a post-campaign report, there are endless advantages to understanding your audience.

Permutive’s Audience Analytics Block is the ideal tool for these insights. It’s a set of tools for you to quickly investigate the audiences you’ve created in Permutive.

By default, this Block contains 3 dashboards.

The Audience Comparison dashboard is a fantastic tool for quickly comparing any audience's side-by-side. It includes engagement metrics and can be customised to include metrics on any data you have in Permutive. Use this dashboard to compare audience segments’ pageviews per session, video completion, gender, page yield… any data you’re collecting in Permutive can be compared, by audience, using this dashboard!

Session Length

Pageviews

Segments Overlap

The Audience Growth dashboard is a high-level summary of your audience segments' total size. Use this tool to understand the total uniques who have fallen into each audience.

Time Series

Cumulative Segment Growth

In summary, this block gives you quick, up-to-date access to audience insights. You can customise the block by adding new metrics and comparing new audiences you’ve created.

If you’re interested in using Permutive we’d love to chat to you, please get in touch by emailing team@permutive.com, or visit our website at www.permutive.com

↧

Yes! We Have No Connectors

April 5, 2017, 5:00 pm

≫ Next: Data Science with Looker: Part I

≪ Previous: Announcing the Looker Block for Audience Analytics by Permutive

Yes We have no connectors

And all of those apps produce data. Data that’s critical to running your business.

Naturally, lots of customers come looking for an analytics tool that can plug into all their SaaS apps, suck the data out, centralize it, magically make sense of it, and let them find insights.

I’m going to be straight with you: Looker is not that tool.

Other tools won’t say that. They’ll say, “Yes! That’s us!”

It isn’t.

I know, because that tool doesn’t exist. And I’d strongly advise caution before buying what they’re selling.

Opportunity Costs

The first reason you should be wary of anyone promising that magical solution is that keeping up to date with dozens of other vendors’ API specs and customers’ unique needs is hard.

Two Flawed Approaches

The second reason to be cautious is that built-in connectors work one of two ways: either they give you a black box or they give you a data dump.

Black Boxness of Connectors

For example, here’s a “basic” entity-relationship diagram for Salesforce’s data tables (and this is just the sales objects):

Salesforce Data Tables

It’s not simple. It’s not easy to understand. And building out those relationships in your analytic tool is difficult, time-consuming, and error-prone.

Our Approach

Looking Ahead

(If you’ve made it to the end, you deserve a song: Yes! We Have No Bananas)

↧

Data Science with Looker: Part I

April 10, 2017, 5:00 pm

≫ Next: Data Science with Looker: Part II

≪ Previous: Yes! We Have No Connectors

Yes We have no connectors

Data Science with Looker and Python

A typical workflow may look like this:

Advanced Analytics Diagram 1

Before a data scientist can start analyzing data in a DSE, they first extract the relevant data from their company’s warehouse. They then spend a considerable amount of time preparing the data, including merging, reshaping, and cleaning this data in their DSE. Once the data has been prepped, advanced functions -- whether they are predictive models or optimizations -- can finally be written in Python or R. But most advanced analytics workflows are iterative. Models are constantly updated to include additional variables. This means repeating steps 1 and 2 (extraction and data prep) multiple times in order to arrive at an optimal dataset.

Advanced Analytics Diagram 2

What’s different about this approach?

All of this is easy because you’re using best-in-class tools together harmoniously, rather than trying to use one tool for everything.

↧

Data Science with Looker: Part II

April 12, 2017, 5:00 pm

≫ Next: Invest in your Most Valuable Asset with HR analytics

≪ Previous: Data Science with Looker: Part I

Advanced Analytics Part II

Data Science with Looker and Python: Part II

To show how seamlessly Looker can integrate into a data science workflow, we took a public dataset (Seattle bikeshare data) and applied a predictive model using Looker, Python, and Jupyter Notebooks.

Follow along as I walk through the setup.

How might the weather affect people’s willingness to ride?

We can then look for a relationship between daily trip count and temperature by selecting three fields -- Trip Start Date, Temperature, and Trip Count -- in the Looker explore panel. Looker automatically writes SQL, retrieves data, and allows us to visualize the results to see if there could be a correlation between temperature and trip count.

Advanced Analytics Diagram 1

A quick examination of the scatterplot indicates a relationship between temperature and ride volume, so let’s build a regression model that predicts trip count as a function of temperature.

The actual regression model is straightforward and can be written in python by leveraging a Stats package --

Y = Average Trip Count
X = Temperature

model = smf.ols(formula='Y ~ X', data=data).fit()
print(model.summary())

Data Analysis pre Looker

Based on our model we can determine that --

A 1 degree increase in temperature leads to nearly 12 (11.977) additional trips
Temperature has a significant effect on trip count (P > | t| is indistinguishable from 0)
The R² of our basic model is .62 (meaning the temperature can explain 62% of the variability in daily trip count)

Explore the Data

Data Analysis

The model now isolates the effect of both temperature and humidity.

For every 1 degree increase in temperature, trip count increases by 7.74 rides.
For every 1 percent increase in humidity, trip count decreases by 4.62 rides
Both variables have a significant impact on trip count.
Our R² has increased to .686

We can continue to go back and bring in additional variables from Looker into our model by simply adding additional fields to the explore page, and making an API call. But for now let's use temperature and humidity as our two factors or dependent variables that could affect trip count.

explore: trip {
  join: trip_count_prediction {
    type: cross
    relationship: many_to_one
  }

y=β0+β1×Temperature+β2×Humidity+β3×OtherFactorsy=β0+β1×Temperature+β2×Humidity+β3×OtherFactors

  measure: trip_count_prediction{
    type: average
    sql:  (${trip_count_prediction.x0} * ${weather.temperature}) + 
          (${trip_count_prediction.x1} * ${weather.humidity}) + 
          ${trip_count_prediction.intercept};;
    value_format_name: decimal_1
    view_label: "Trip Time Prediction"
  }

Let’s start by comparing our trip counts to those predicted by our model -

Compairing Trips in Looker

Trip Regression

Our predictions seem to be in line with the actual trip counts for those days since a graph of the actual trip count vs the predictive trip count shows a linear relationship. Let’s take this a step further and use our model to forecast average trip times/lengths for future dates based on the same factors we've outlined above (the temperature and the humidity).

Trips per Date

  filter: weather_variance {
    type: string
  }

  dimension: adjusted_weather {
    type: number
    sql: ${weather.temperature} + CAST({% parameter $weather_variance %} AS FLOAT64) ;;

  }

  measure: trip_count_prediction_what_if {
    type: average
    sql:  (${trip_count_prediction.x0} * ${adjusted_weather}) +
         (${trip_count_prediction.x1} * ${weather.humidity}) +
         ${trip_count_prediction.intercept};;
    value_format_name: decimal_1
    view_label: "Trip Count Prediction"
  }

Trip count upper and lower bound

How might Weather factors affect the bike overflow rate at a given station?

We can pull in data from a particular start station (by temperature) and an end station (by temperature) by exploring data with those fields in Looker.

Linear Regression Output

Now that we have our predictive model, let's pull it up in Looker and calculate how many Trips were taken per station.

Join in station prediction table in Looker and create fields to predict trips taken based on Temperature.

explore: trip {
  join: station_prediction  {
    type:  left_outer
    relationship: one_to_one
    sql_on:  ${trip.from_station_id} = ${station_prediction.bike_station} ;;
  }

Our predictive measure in Looker, follows the linear regression model that we used and pulls in the coefficients and intercept that have been persisted back into the DB.

Linear Regression Model:

Linear Regression Equation

In our case:

Linear Regression Equation Weather

  measure:  start_predictions {
    type:  average
    sql: (${start_slope} * ${weather.temperature} ) +  ${start_intercept};;
    value_format_name: decimal_1
  }

  measure:  end_predictions {
    type:  average
    sql: (${end_slope} * ${weather.temperature} ) +  ${end_intercept};;
    value_format_name: decimal_1
  }

  measure:  predicted_station_overflow {
    type: number
    sql:  ${end_predictions} - ${start_predictions} ;;
    value_format_name: decimal_1
  }

Rides Map

Pretty cool right?

↧