Invest in your Most Valuable Asset with HR analytics

April 17, 2017, 5:00 pm

≪ Previous: Data Science with Looker: Part II

HR analytics

In this competitive job market, employers are finding it increasingly difficult to recruit, retain, and motivate top talent. Yet, HR continues to rely on outdated employee engagement strategies that no longer work. As pressures increase from an evolving workforce, companies from startups to enterprises must establish HR & People Analytics now more than ever.

Organizations struggle to understand their complex existing and potential workforce and how to use each effectively. Which applicants should they recruit? Which of their hires do they wish to retain for their performance and productivity? Who amongst their internal talent do they wish to groom for career advancement? What are the most effective compensation, benefits and development options that will optimize the organization’s competitiveness in the marketplace? As we navigate today’s dynamic economy, do we need to retrench again or pursue growth? We want answers to these questions almost on a daily basis, in addition to the most obvious one: what is our headcount?

Built by industry experts on a solid foundation of years of experience in Marketing and Sales analytics across many industries, Datasticians’ HR & People Analytics Looker Block is built to address the above mentioned reporting needs. Since it is operating on the Looker platform, with all business logic residing in a centralized modeling layer, there’s never a need for manual reconciliation, and all analytics are surfaced in real-time.

The Block for HR & People Analytics works with data from Recruitment systems (JobVite, Kenexa, Taleo, Success Factors, etc.), HR data (Workday), and Professional Networking Social platforms, and ties them together into a single central warehouse (Redshift, Snowflake, RDBMS, Google BigQuery, etc.) to allow users to gain insight from a blended data spectrum.

Users can extend the Block to integrate compliance related data to monitor related metrics. HR no longer needs to run individual reports, but now from a single dashboard launch pad, can monitor US EEO, AAP, Vets100, and other compliance reporting to view the state of compliance to determine if action is necessary.

Dashboards include but are not limited to:

Overview
Workforce Effectiveness
Recruitment
Employee Performance

Current Customers should reach out to their Looker Analyst.

Not a customer? Give Looker a free trial today.

↧

Elastic Analytic Databases

April 19, 2017, 5:00 pm

≫ Next: Culture Comes from the Inside Out, not the Outside In

≪ Previous: Invest in your Most Valuable Asset with HR analytics

Elastic Analytic Databases

I used to buy VHS tapes. It was great. I could buy my favorite movies and watch them whenever I wanted. Except most of the time I owned a movie, I wasn’t actually watching it. And when DVDs came out, I had to replace my whole collection to upgrade.

Then video rentals came along and I mostly stopped buying movies. Why pay to keep a movie on the shelf when I could access a vastly larger pool of movies whenever I wanted? Except, of course I was limited to one movie at a time, and sometimes the movie I wanted wasn’t available.

That’s why Netflix’s streaming model was such a revelation. Suddenly, I had access to an enormous catalog of movies in an instant and they were never out of stock. And if I wanted to binge watch, I could, without having to drive to the store to trade discs.

Movies transformed from a physical thing I bought, to a shared service I leased a fixed part of, to a utility that scales seamlessly to meet my current needs.

Analytic databases are following the same path.

At first, your only option was to buy a database of a given size, with given performance limitations. At certain times it got used a lot (closing the quarter or doing some big transformation). On other days it might not get used much at all. But not using it didn’t save you any money, because you’d paid up front. And you were stuck with the featureset and power you initially bought.

In the last five years, analytic databases moved to the cloud. That was wonderful, because you no longer had to worry about maintaining them and you could scale them up in a matter of hours. But you still were leasing a fixed part of the cloud and you still paid even when you weren’t using that part.

But now we have offerings like Google BigQuery, Snowflake, and AWS Redshift with Spectrum. They’re Netflix. They scale instantly in response to your current needs.

Elastic Analytical Databases

We call this new breed of warehouses “Elastic Analytical Databases”. These databases can scale up and down as needed depending on your workload. They share three key features that contribute to their elasticity:

Elastic Storage and Compute are Separate
Elastic databases are built on cloud storage systems like Google Cloud Storage or Amazon S3, which means they’re built to handle exabytes of data. This also means that you never need to worry about running out of storage. Because storage is relatively cheap (as compared to compute power), this makes it attractive to store all your data where the elastic database can access it.

Elastic Computing Power Scales (Almost) Infinitely
Elastic Databases have access to vast computing resources, so they can pull as many machines as are necessary to execute your workload. This all happens in a single database architecture--every query runs against the entire cluster. This means that elastic databases can return query results in a relatively consistent amount of time, no matter how large or complex the query is. Some architectures allow you to specify the size of cluster for a given workload, so you can trade speed and size, but just for a long as you need it to complete your workload.

Usage-based Model, Rather than Ownership-based Model
Most of these databases use permissioning, rather than physical separation, to keep different customers’ data separate. This allows providers to smooth demand by sharing power across many customers. So instead of paying to own your own instance (whether you’re currently using it or not), you pay to get access and use the instance for the time you need it.

Unlimited Storage

Cloud storage that has the capacity to scale seamlessly no matter how much data you put in it is a key component of elastic databases. Whereas disk usage is a long-time headache for database admins, Amazon S3 and Google Cloud Storage ensure that you never hit capacity thresholds.

Not having to worry about running out of space is freeing. Just like managing Netflix’s infrastructure isn’t your responsibility, unlimited storage means you no longer need to focus on manually scaling up your cluster size with increased data volume. It lets you focus on doing good analysis, rather than capacity planning.

Unlimited Computability

Elastic database are built to handle many workloads at once, so they have to scale out. Scaling up won’t work. That makes them very powerful. The same way that Google returns results to any question in seconds by hitting hundreds of machines in parallel, queries against an elastic database are similarly performant.

At the 2016 Google Cloud conference, I watched as a Googler ran a query against a petabyte of data. That’s the equivalent of ~110 consecutive years worth of movies. Querying that almost unthinkable amount of data took about two minutes. Similarly, a 4TB query run against a table with 100 billion rows stored in BigQuery returns in under 30 seconds.

The number of individual machines and amount of network bandwidth needed to run a similar query against a non-elastic database is massive, and this doesn’t even include the technical skill and resources needed to manage such a system. And when you’re managing those resources yourself, annoying things like failing servers are your problem. With an elastic database, Google or Amazon or Snowflake handles those failures gracefully in the background.

With an elastic database, the resources needed to run an enormous query are trivial because the system designed to run these databases is much, much larger. These database providers can loan you these vast resources for seconds at a time because they’re only a fraction of a fraction of the entire collection of machines that make up an elastic cluster.

The result is that your queries are never slow, and the cluster never goes down.

A New Usage Model

Unlike other analytical databases, you share only one instance of BigQuery or Athena with thousands of other customers. You dump your data into BigQuery or Athena, both of which are managed for you, and just start querying. No more spinning up or sizing your own instance. With Snowflake, you simply ask for as many machines as you need to perform your function, but only for as much time as you need it, essentially forming an on-demand army of machines that you can use for only the time you need.

With Netflix, you’re not paying for the movie itself. Rather, you’re paying to be a part of the Netflix network and use their services. With elastic databases such as BigQuery, Snowflake, and Athena, you’re not paying to own the database instance itself. Instead, you’re paying to be a part of the instance, and leverage their massive compute power for seconds at a time.

Looking to the Future: Sharing Instead of Moving

Today, if you have data in Salesforce, Zendesk or any other SaaS tool, you have to use their API to move that tool’s data into a database in order to query it. Some APIs are better than others, but by the time the data arrives it is out of date. How out of date the data is depends on the mechanism you are using but that delay can range from minutes to days.

As elastic databases take hold, I believe more and more of these SaaS tools will move to make their data available directly where these new databases can access them. Google’s recent introduction of BigQuery Data Transfer Service does exactly this. Just like sharing a Dropbox or Google Drive folder instead of downloading files to a USB drive, we’re moving to a world where data will be shared, rather than moved.

The data will therefore always be fresh and up to date. This instantaneous sharing stays constant no matter how big the data gets because it’s never moving.

Will Elastic Analytical Databases Win?

We still don’t know how the data community will adapt to the new paradigm of elasticity. Will the benefits be enough to shift the data community towards these elastic databases? Or will unforeseen obstacles prevent them from taking hold?

At Looker, our unique architecture lets our customers immediately take advantage of these new developments in database technology to see if they’re right for them. It may be worth learning more about elastic databases to see if they’re the right fit for your organization; we think they’re pretty amazing.

Want to learn more about how Looker works with elastic databases? Read our full BigQuery Whitepaper, our page on Looker + AWS, and this page about our partnership with Snowflake.

↧

Culture Comes from the Inside Out, not the Outside In

April 23, 2017, 5:00 pm

≫ Next: 5 Tips to Becoming a Data-Driven Marketer

≪ Previous: Elastic Analytic Databases

perks

Looker was recently announced as the #3 mid-sized company and #1 mid-sized tech company on the Silicon Valley Business Journal’s 2017 Best Places to Work list. When people ask me about what makes Looker so special, they often want to know about our perks.

“Do you provide free food? Do you offer unlimited Paid Time Off? Do people go surfing during the day? Do you provide whiskey on tap?” The answers are “yes; yes; if the swell is just right; and no.” But if you are trying to understand what makes Looker so special, those are the wrong questions to ask. Those are but outward manifestations of much deeper beliefs that permeate Looker and are core to who we are as a company.

Take unlimited paid time off, for example. On the surface that might be interpreted as a somewhat lax approach to performance or at least a lower level of intensity. Some might attribute that to our headquarters being in Santa Cruz or to something you have to offer in order to compete for talent. Dig into that one a bit and you’ll see that it’s really about autonomy. Lookers work hard. We are on a four-week release cycle and that quick pace permeates the company. I’ve never seen people work harder to make sure the customers who are getting that release are ready and supported. Because we know Lookers will do whatever it takes to meet the needs of those customers, we trust them to know what they need to do to meet their own needs -- to de-stress, re-charge, and be at their best. We call it “making time to shred.”

That commitment to the team is born out of commitment to the mission, commitment to the customer, and commitment to the product. I think the various clubs we sponsor facilitate that commitment. The clubs allow us to connect with each other at a deeper level. When we see a fellow Engineer attack a particularly tough trail and not get tossed off their bike, we have newfound respect for them. When we see an Account Executive struggling to get up on a surfboard for the first time, we are inclined to root for them. Those feelings of respect and support translate into the work environment. The trust built during these activities carries over into the office. The next time someone with whom you’ve spent two hours playing board games challenges your ideas, you trust that they are doing just that -- challenging your ideas, not challenging you.

If you look at our perks at a deeper level you will see they are manifestations of the environment Lloyd Tabb and Ben Porterfield, our Founders, have been creating from Day One -- an environment of emotional safety and helpfulness; autonomy and empowerment. And I have never seen a group of employees more protective of their culture and more committed to making sure that culture scales as we grow than I’ve seen at Looker.

My first week at Looker I had bruises on my thighs because I kept literally pinching myself to see if I was dreaming. The commitment our employees have to our customers and to each other is magic, and it’s our collective job to make sure we do everything we can to foster and nourish that -- from weekly Happy Hours to providing a home-cooked, healthy lunch for the company to sit down and enjoy together twice a week. Our spending time with each other and truly getting to know one another is the fertile soil that allows our great product and fanatical commitment to our customers to truly flourish.

The next time you interview for a job, ask someone if they know why their company offers the perks they offer. See if they can articulate a deeper set of beliefs behind those perks. If you are interviewing at Looker, I’m confident in the answers you will hear.

↧

5 Tips to Becoming a Data-Driven Marketer

April 27, 2017, 5:00 pm

≫ Next: Using New York City Taxi Data to Avoid Airport Traffic

≪ Previous: Culture Comes from the Inside Out, not the Outside In

perks

Most people can pull up a pretty visualization or create a wonderful presentation with numbers. But as a marketer, and also someone who has taken advanced statistics courses, numbers are meaningless unless you trust and understand the metrics behind them.

We live in a world where every application shows you metrics - in our marketing stack I could pull a report in Marketo, Salesforce, Google Analytics, Optimizely and probably ten more tools that contribute to our department's success.

The problem with using these tools individually is that they only give you a piece of the puzzle and not the whole picture. And odds are, I wouldn’t even get the same numbers from all of those tools.

As a program manager, I work with leadership to figure out how my piece of the marketing puzzle contributes and drives the top level metrics that affect the business. The key metrics for both marketing and the top level are defined through several dashboards in Looker. Once I have those reports and data dashboards, I backtrack and think of my ability to take action from a tactical point of view. What parts of my program can I optimize to make marketing, and more specifically, demand generation, more successful?

Contributing to the company means exposing your results in a consolidated data analytics platform that brings everyone’s efforts together. Every program manager at Looker is in tune with the company’s overall marketing strategy and KPI’s and using Looker is an essential part in understanding how they can help drive the business forward.

Here are the 5 things I’ve discovered.

Find your technical champion. You need at least one person focused on marketing analytics for either your department or organization — a technical champion. They can work with you to figure out how to consolidate or ETL your data sources and then model them out. This is actually really difficult, but is the ticket to success in having a voice. Investing time and energy upfront with that champion to understand the setup of your metrics (in our case, using our Looker data model) will pay off in spades. This legwork, while time consuming, will allow you to understand exactly how the metrics you rely on are defined. After all, you are the one making decisions with your data.

Get the right data. Figuring out the bigger picture of what data is significant to collect, track, and analyze is extremely important. Work with your managers or technical teams to be certain you are tracking and collecting everything you need.

Define metrics once, and then leverage them endlessly. My time is not spent defending how my efforts fit into the success of the company, but rather making my program and numbers grow. I only had to spend time once with our analytics manager discussing “Is this the right way of giving marketing attribution?”.

Embrace the power of data at your fingertips. Ask me anything about any of our publishers. Anytime. How did one publisher perform in October 2016 in terms of meetings, pipeline and closed business for the sales team? Give me two minutes, give or take for internet speed. How did that group compare to a paid webinar by a different publisher? Now we’re at four minutes. The great thing about having this level of access is that at some point, asking the data becomes the only thing to do. It’s so easy and so fast, that doing anything else feels like a waste of time. If that's not power at your fingertips, I’m not sure what is.

No more egos. This data-driven culture helps ease tension in the workforce. As a millennial, getting results from trusted data is power. Employees spend time working on their program rather than politicizing for a promotion. A point can be proved with no subjectiveness or arbitrary opinions.

I can’t imagine working for a company that didn’t operate like this. Without a centralized data platform that gave me - a marketer - access to all the data I need - every step of the way, I wouldn’t be able to work as efficiently as I do today. More than that, I wouldn’t be a self-sufficient, data-driven marketer with beautiful dashboards I trust with….well, my job!

↧

Using New York City Taxi Data to Avoid Airport Traffic

May 8, 2017, 5:00 pm

≫ Next: Finance: How we Looker at Looker

≪ Previous: 5 Tips to Becoming a Data-Driven Marketer

exploring taxi data

As Looker’s New York team, we’re constantly using taxis to get around the city. Taxis are a huge part of New York City culture and they play a key role in our team’s productivity. Since data is central to everything we do at Looker, we thought it’d be interesting to see if we can use data to optimize the taxi experience.

New Yorkers rely on taxis, that much is clear to anyone who’s been there and it was immediately clear in the data. The data also supports a hunch that many New Yorkers have; it is nearly impossible to get a cab at 4:30 p.m.. This is due to the unfortunate alignment of a shift change and rush hour.

Manhattan Taxi

Members of our NYC team are constantly heading to the airport to meet with customers or to fly out to our Santa Cruz HQ, so we decided to use this opportunity to gather some data and use it to improve our travel experience. (As opposed to our current ‘system’ of simply relying on the advice of our team’s native-New Yorkers).

By looking at NYC taxi trip data (count, location, and trip length), we formulated some tips for our own taxi-taking strategy.

Catching Taxi

Good News for Morning People

The data confirms that the best time to catch a cab to the airport is 4:00 a.m. At this prime hour, the average cab ride to takes 20 minutes compared to the 41 minutes it takes to travel 3:00 p.m. - the worst time to try to get to the airport.

Since most of us agree that hailing a cab at 4:00 a.m. is ridiculous (unless you’re coming home from an amazing night out), the next best times to get a taxi are between 10:00 a.m. – 12:00 p.m.

We were also surprised to find that traffic from Manhattan towards both Laguardia and JFK is relatively smooth at 7:00 a.m.; We expect this is because early birds are already at their destination for the day. Also, the bulk of the 7am traffic is heading into Manhattan, rather than outbound.

If an early morning trip isn’t possible, we advise you avoid scheduling trips to the airport between 1:00p.m. – 6:00 p.m. Overall, the data states that if you’re heading to the airport, an early or late travel time will serve you better than an afternoon trip.

Image: Length of Taxi Trip (in minutes) by Hour of Day

Best Days to Taxi to the Airport

If you want to optimize your taxi-time to the airport based on day of the week, Monday is clearly your best bet. Want to be extra swift? Nothing beats Monday at 4:00 a.m.

Alternately, Thursday is the worst day catch a cab to the airport with traffic at its peak from 3:00 p.m. clear through to the end of rush hour. Next time you decide to take Friday off for a three-day weekend, you may want to consider leaving on Thursday at lunchtime, or waiting until Friday morning to skip the peak traffic and taxi demand time.

Worst Locations to Catch a Taxi to the Airport

The data confirmed our collective suspicion that Times Square and Rockefeller Center are the absolute worst spots to catch a taxi in NYC. (Yet another reason for New Yorkers to avoid the touristy spots!)

manhattan heatmap Image: Heatmap of pickup locations for taxi trips headed to the airport

While our deep dive into New York City Taxi data validated many of our hunches, it also revealed some great travel tips, less-than-optimum hailing locations and travel times to avoid.

If you’d still like to explore and optimize your own taxi strategy, we’ve the loaded our findings into Looker’s public data with BigQuery, and you can access it here.

If you’d like to learn more about doing cool things with data in Looker, or optimizing your business strategy with data, request a chat with our technical team.

Wishing you safe and data-driven travels,

The Looker Team, NYC

perks

↧

Finance: How we Looker at Looker

May 10, 2017, 5:00 pm

≫ Next: Query Exabytes of Data in AWS with Looker’s Native Support of Amazon Redshift Spectrum

≪ Previous: Using New York City Taxi Data to Avoid Airport Traffic

Finance

Looker is growing very quickly. While our growth is dramatic, our ability to scale using Looker may be best demonstrated (in my not-so-humble opinion) by the Looker Finance Department.

When I joined Looker a few years ago, I was the first finance hire outside of our CFO. Today, after several years of 100%+ Yr/Yr growth, two new business entities, and roughly 250 more employees, I have a team of two and we use Looker to help manage and track so much of what we do.

How we Looker

As an FP&A professional, we have several core systems that we leverage to do our jobs and provide the company with key metrics and guidance: a General Ledger & ERP system (Netsuite), a Budgeting or Financial Planning tool (Adaptive Planning), a CRM (Salesforce.com), and often a BI tool to help with reporting and Analytics (Looker).

Outside of those tools is the finance professional’s swiss army knife/crutch: Microsoft Excel. It’s often a “go to” for ad hoc analysis and creating reporting packages for things like Operational reviews and Board of Directors packages. It’s a bold statement, but I’ve spent drastically fewer hours in Excel since coming to Looker than I have in my previous work experience.

At Looker, we’ve been able to centralize the data from all of the above systems in a single database, and then use Looker (the tool) to build our reporting logic into LookML (the modeling layer).

The secret sauce that allows my team to scale our reporting infrastructure is Looker’s reusability. We can build everything once and then reuse that same report framework at the close of every fiscal period.

This has enabled us to drastically cut down the time to value for Monthly/Quarterly KPI’s because our reporting package is pre-built within Looker. Once the accounting team closes the fiscal period, I just press “Run” on our Executive dashboard and I’m instantly presented with updated Revenue, Gross Margin, CAC, MTR, MRR/ARR, and just about anything else I choose to dig into.

Because we can drill in and do our ad hoc analysis directly within Looker, we rarely do any .csv exporting of files and massaging in Excel. Not only does this save time, but it also removes the element of potential human error while still being able to drill into the details of a given report.

NERD ALERT: One of the coolest use cases we’ve built in Looker is our A/R aging report.

We have 900+ customers of various billing structures and payment schedules, all managed by one person on the Accounting team. Looker makes this possible because we can join our Netsuite A/R data with the matching customer record from Salesforce to bring in Account Manager and other information about the account. This data allows us to leverage the sales relationship to collect on any delinquent accounts, and allows us to slice and dice GAAP compliant financials by the infinitely many fields that can be created in Salesforce.

ERP software is often rigid and not meant for the capturing the same type of business specific customer data that Salesforce enables. In the traditional FP&A reporting structure we would have to take the two systems’ Excel extracts and mash them together in Excel via manual effort. With Looker, we mapped the fields together one time and again leverage the reusability of the modeling layer. Although I can’t disclose the exact state of our A/R as a whole, we are very happy with our ability to fund a large portion of our daily business operation through our collections.

As our team continues to support Looker’s incredible growth, I’ve found scaling the business by leveraging Looker to be one of the most rewarding parts of my work. As FP&A professionals, we’re constantly looking for efficiencies and ways to improve or do something better. Looker presents me with those opportunities in my work every day, and I wouldn’t be able to do my job without it.

↧

Query Exabytes of Data in AWS with Looker’s Native Support of Amazon Redshift Spectrum

May 17, 2017, 5:00 pm

≫ Next: JOIN 2017: 5 Things You Don't Want to Miss

≪ Previous: Finance: How we Looker at Looker

AWS Spectrum

At the AWS Summit on Wednesday, April 19th, 2017, Amazon announced a revolutionary new Redshift feature called Spectrum. Spectrum significantly extends the functionality and ease of use of Redshift by allowing users to access exabytes of data stored in S3 without having to load it into Redshift first.

Based on the resounding cheers of the crowd at the Summit and early interest from Looker customers, it’s a feature that people are pretty excited about. Looker customers on Redshift can take advantage of the feature today to maximize the impact of both technologies.

To highlight the speed and power of Spectrum during his Keynote speech, AWS CTO Werner Vogels compared the performance of a complex data warehouse query running across an exabyte of data (approx. 1 billion GB) in Hive on a 1,000 node cluster versus running the same query on Redshift Spectrum. The query would have taken 5 years to complete in Hive, and only 155 seconds with Spectrum at a cost of a few hundred dollars.

An Overview

In addition to the obvious convenience factor of being able to directly access data stored in S3, the ability to query this data directly from Redshift in S3 allows Redshift users to access an exceptional amount of data at an unprecedented rate. To boot, this will lower costs for users while giving them more granular data than ever before. Prior to Spectrum, you were limited to the storage and compute resources that had been dedicated to your Redshift cluster. Now, Spectrum provides federated queries for all of your data stored in S3 and dynamically allocates the necessary hardware based on the requirements of the query, insuring query times stay low, while data volume can continue to grow. Perhaps most importantly, taking advantage of the new Spectrum feature is a seamless experience for end-users; they do not even need to know whether the query they ran is executed against Redshift, S3 or both.

Other benefits include support for open, common data types including CSV/TSV, Parquet, SequenceFile, and RCFile. Files can even be compressed using GZip or Snappy, with other data types and compression methods in the works.

What This Means For Lookers

Spectrum will allow Looker users to dramatically increase the depth and breadth of the data that they are able to analyze. Extremely complex queries can now be run over vast amounts of data at unprecedented scale.

Looker’s native integration, combined with Redshift Spectrum offers the possibility of an infinitely scalable data lake/data warehouse, exciting new modeling opportunities, and expanded insights for businesses. Data stored in Redshift and S3 can be modeled together, then queried simultaneously from Looker. For example, extremely large datasets, or datasets that are subject to extremely complex queries, can be stored in S3 and take advantage of the processing power of Spectrum. The new feature will also allow for hybrid data models, where newer data is stored in Redshift and historical data is stored in S3, or where dimension tables and summarized fact tables are stored in Redshift, while the underlying raw data is stored in S3. A well designed data storage architecture, combined with the power of a Looker model, will allow Looker users to easily traverse between data aggregated from terabytes of raw data stored in the original Redshift cluster, down to the individual events that comprise the aggregations that live in S3.

With this functionality users have access to more data, deeper insights and blazing fast performance from one data platform.

Pricing for data stored in and queried against your existing, relational Redshift cluster will not change. Queries against data that is stored in S3 will be charged on a per-query basis at a cost of $5 per terabyte. Amazon is providing several recommendations on how the data in S3 should be stored that will minimize the per-query costs. These same recommendations also maximize query performance.

Spectrum gives customers unparalleled ability to leverage their data. As companies are collecting more and more data, they need ways to store and process that data quickly and cost effectively and Spectrum is an elegant solution to that problem. Users will always ask to dig deeper, and databases like Redshift with Looker on top allow companies to store, process and expose the sheer quantity of data required to enable that. It’s interesting to watch, and I look forward to seeing how Redshift continues to innovate in this rapidly evolving market.

Read more about Spectrum

↧

JOIN 2017: 5 Things You Don't Want to Miss

May 17, 2017, 5:00 pm

≫ Next: Why Nesting Is So Cool

≪ Previous: Query Exabytes of Data in AWS with Looker’s Native Support of Amazon Redshift Spectrum

5 things join

JOIN is back! And this time we’re setting up shop in San Francisco.

The goal for JOIN is simple: Provide an intimate environment where smart data people can meet each other, mingle with our Looker team and walk away feeling like they actually learned something.

To get you excited for JOIN 2017, here are five things we have in the works.

Unfiltered Access to the Looker Team: Have a question for Looker founder, Lloyd Tabb? Curious about our product roadmap? Secretly hoping to meet your chat support team member IRL? Then come to JOIN 2017. Our entire team is ready to roll up their sleeves, drill down into queries, talk through LookML tricks and tips or just hang out and chat data.
Carefully Curated Content: Our content team is busy working on sessions that will help you develop -- or advance -- your Looker skills. Need help getting started with Looker? Want to explore practical skills? Is there a complex model you need to build? Come to JOIN where there will be a session or workshop designed to help you see what’s possible when you just keep asking.
Speakers -- who are just like you: There are a million things you can do with Looker, but often times it’s hard to discover them all. Come to JOIN to learn about the innovative ways people in your industry, including folks from BlueApron, GitHub, PDX and more, are using Looker to make better business decisions that foster a data-driven culture.
Product Updates and Announcement: Remember LookVR? That was announced at last year’s JOIN. This year, our product and engineering teams are heads-down working on some **secret** projects that we can’t wait to share with you. Wish we could say more now, but this is our teaser point to get you excited -- and curious -- about the future of Looker.
Food, Fun and a Fantastic Party: What’s a conference without good food, plenty of drinks, spirited parties and cool swag? Our events team knows a thing or two about throwing a memorable tech conference. Lucky for you, they know what works. Food will be better than your mom’s best home cooked meal, drinks will be flowing, and we promise you’ve never seen swag like this before. Plus, we’re planning a party. As for the parties… we can’t tell you much more, other than be prepared to sleep in the next day.

There you have it! Five things you won’t want to miss at JOIN 2017. The good news is we have a few more delights in store. So, block your calendar (September 13-15), book a flight (to SFO), and register today.

See you at JOIN!

↧

Why Nesting Is So Cool

May 23, 2017, 5:00 pm

≫ Next: Is our story of the 2016 election wrong? Let’s Ask the Data.

≪ Previous: JOIN 2017: 5 Things You Don't Want to Miss

Why Nesting is so Cool

When you're setting up a data warehouse, one of the key questions is how to structure your data. And one of the trickiest types of data to deal with is hierarchically structured data (like orders and the order_items that make them up; or the pageloads that make up sessions).

Do you completely normalize the data into a snowflake schema? Completely denormalize it into a very wide table with lots of repeated values? Or do something in the middle like a star schema?

Historically, that's a decision you had to make as you were designing your warehouse. And it always involved trade offs. Denormalization (one big wide table with lots of repeated values) takes up more space. But by eliminating joins, it's faster for some queries. Normalization, on the other hand, avoids repeated values and yields more compact tables. So it's more space efficient and makes other types of queries faster.

But some of the newest warehousing technologies, like Google BigQuery, offer a third option: don't decide, nest.

To explain what this means and why it's so powerful, let's imagine a (really) simple dataset describing orders and the items that make up each:

Order 1, 2017-01-25: {2 Apples @ $0.50 each, 3 Bananas @ $0.25 each}
Order 2, 2017-01-26: {4 Carrots @ $0.20 each, 1 Date @ $0.60 each, 1 Endive @ $2 each}
Order 3, 2017-01-27: {2 Fennel @ $1 each, 2 Bananas @ $0.25 each}

One way to structure this data would be in a snowflake schema:

Snowflake Schema

This makes counting orders (SELECT COUNT(*) FROM orders)really cheap since it doesn't need to worry about any of the dimensions of those orders. But getting revenue by day requires two joins (and joins are usually expensive):

SELECT
  order_date,
  SUM(unit_price * quantity)
FROM
  orders
  JOIN order_items USING(order_id)
  JOIN items USING(product_id)
GROUP BY 1
ORDER BY 1

Alternatively, you could completely denormalize the data into one big, wide table:

Orders Denorm

That makes the revenue query dead simple and very cheap:

SELECT
  order_date, 
  SUM(quantity * unit_price) 
FROM 
  orders_denorm 
GROUP BY 1
ORDER BY 1

But now to get a count of orders, you have to do:

SELECT COUNT(DISTINCT order_id) FROM orders_denorm

That scans 7 rows instead of 3; plus DISTINCT is expensive. (And yes, when your table is 7 rows it’s not a big deal, but what if it’s 7 billion?)

So neither is ideal. Really, what you'd like is not to choose your schema until the last moment, when you know what kind of query you're running and can choose the optimal schema. And that's what nesting does.

So here's what the same data looks like nested in BigQuery:

Nested BigQuery

Basically, you've nested a whole table into each row of the orders table. Another way to think about it is that your table contains a pre-constructed one-to-many join. The join is there if you need it, at no additional cost, but when you don't need it, you don't have to use it.

Now, when you just want to know how many orders there were, you don't have to scan repeated data or use DISTINCT. You just do:

SELECT COUNT(*) FROM orders

But when you do want to know more about the items in the orders, you simply UNNEST the order_items fields. This is far less computationally expensive than a join, because the data for each row is already colocated. That means the unnesting can be performed in parallel (whereas joining, even when joining on a distribution key, must be performed serially).

The SQL for unnesting can be a bit strange at first, but you quickly get the hang of it. And luckily, Looker can write it for you anyway. Here's how you'd find out the total revenue for the last three days:

SELECT 
  order_date,
  SUM(order_items.quantity * order_items.unit_price)
FROM 
  orders
  LEFT JOIN UNNEST(orders.order_items) as order_items
GROUP BY 1
ORDER BY 1

(You'll note that even though there appears to be a LEFT JOIN in the query, since you're just joining a nested column that's colocated with the row it's joining, computationally this join is almost costless.)

To sum up, when you're dealing with data that is naturally nested, leaving it nested until query time gives you the best of both worlds: great performance no matter what level of the hierarchy you're interested in.

↧

Is our story of the 2016 election wrong? Let’s Ask the Data.

May 30, 2017, 5:00 pm

≫ Next: Driving Growth with Data: Mary Meeker's 2017 Internet Trends Report

≪ Previous: Why Nesting Is So Cool

Is our story of the 2016 election wrong?

There were a lot of stories told about the 2016 Election. Working Class Whites? Hispanic voters? Unlikable candidates? But now the data’s in. Which holds up?

Earlier this month, the Census released interviews with 130,000 Americans in the wake of November’s election. Now we have the data necessary to test each narrative and see if they stand up in the light of data

Unfortunately, the Census’ tool for accessing the data is very early 2000s. (It’s a Java Applet.) So I downloaded the whole dataset, loaded it into Google BigQuery and threw Looker on top so we can quickly and easily slice and dice the dataset.

So, let’s get started.

A core story of the election was that the White Working Class (WWC) loves Donald Trump, and that they would turn out in huge numbers for him. Political pundits were sure that Trump would bring out voters who usually stayed home. Except, not really.

So then I figured maybe it was a regional thing. In the Midwest, surely the WWC turned out in droves, right? Wrong. In fact, the data shows voting rates actually dropped off from 2012 among the White Working Class voters in Wisconsin, Michigan, Iowa, and Ohio. They did rise in Pennsylvania (a key battleground state), but only by 1% point.

Alright, let’s check out the flip side. Another theory was that Hispanic voters were going to turn out in huge numbers to oppose Donald Trump. Buuuuut, that doesn’t hold up either. Hispanic turnout peaked in 2008 at 50% of eligible voters, but was back down to 48% by 2016.

Ok, so what actually happened? One thing that wasn’t much discussed, but clearly had a big impact, was the dropoff in voting rates among African Americans. This might be expected after two consecutive elections with Barack Obama on the ballot. But still, the dropoff was very significant. Participation of Black men in 2016 was even lower than it had been in 2004 — before Obama was on the ballot. Among Black women, voting dropped seven percentage points from 2012 to 2016.

When we zoom out a bit and look at the long-term trends in American elections, it turns out that 2016 wasn’t so much an aberration as another step in the same long pattern.

The data shows that White, non-Hispanic voters with no college education are making up a smaller and smaller share of the electorate. Hispanic voters are making up a larger and larger share of the electorate, but it’s because the population is growing, not because voting rates are changing. Black voters are relatively stable, with increases in 2008 and 2012, and a dropoff in 2016.

Basically, the story is that there wasn’t a huge new thing that made this election “different.” In short, we live in an evenly split country that’s changing slowly. So even small changes in the makeup of the electorate can swing an election.

Maybe that isn’t the story we were telling ourselves last October, but, well, data has a great way of knocking down alternative facts.

↧

Driving Growth with Data: Mary Meeker's 2017 Internet Trends Report

June 4, 2017, 5:00 pm

≫ Next: Tenjin Looker Block: Run Smarter App Marketing Campaigns

≪ Previous: Is our story of the 2016 election wrong? Let’s Ask the Data.

Mary Meeker Report 2017

If there’s one place to get a better understanding of the latest in tech, it’s Mary Meeker of Kleiner Perkins’ Annual Internet Trends Report. As I read through it this year, there was a crystal clear theme that has driven all of these innovative businesses forward: data. No matter what industry, the ability to harness the power of data is the critical component that sets apart the disruptors and innovators from the followers.

Last year, Looker had the honor of being included in this report as Mary introduced the third wave of Business Intelligence and showed the rise of the data platform. No longer do business need to rely on a complex and rigid stack of tools from the 1st wave, nor do they need to rely on connecting together a box of parts from the 2nd wave. The 3rd wave delivers a data platform that can serve the needs of every department and function. Add in the flexibility to push data into an employee’s workflow in the very moments when they need to make a decision and you have a serious competitive advantage.

One year later, we’re seeing the results of companies that have embraced this third wave and built this data platform. A large number of Looker customers are now featured as the big innovators on their categories. These companies are leveraging data to disrupt the industries they operate in and provide better experiences for their customers - and it’s starting to pay-off in a big way.

Crushing Candy with Data

Gaming is a colossus of an industry valued at over $90 billion dollars and producing a mountain of user generated data. Mobile game developer King reports that Candy Crush has daily active users that spend an average of 35 minutes a day on their apps resulting in a slough of interactions and event data.

Video Games Most Engaging Social Media

Through this data, gaming companies are discovering customer purchasing behavior and the ability to use level balancing to ensure users continue using their product.

Video Games Self Optimizing

See Gaming Data in a Looker Dashboard

Data is the New Black

Glossier is an online beauty supply startup that operates exclusively online. Through social media influencers and user generated content, Glossier is able to reach a widespread audience through their content marketing which has translated into customer growth of over 500% in the 2015 and 2016.

Content Marketing

Getting Fit with Data

Peloton is a company that provides home workout programs to their customers through a monthly subscription service. Through their user data, they’re able to track customer retention, workout logs and customer cohorts to track trends in their product usage.

Fit with Data

Data Can Calm the Mind

Headspace is an app that provides guided meditation to their customers. Customers of Headspace have their own activity dashboard within the app that displays their usage information such as Total Number of Sessions and Average Session duration. These metrics not only gives Headspace valuable customer usage information but can also help re-engage customers who’ve stopped using the product.

Fit with Data

Regardless of industry, these innovative companies are using data to truly change their industries and they are starting to get noticed. Make sure to check out the full Internet Trends report from Kleiner Perkins to learn more and start thinking about how you are building data into the core strategy of your business.

↧

Tenjin Looker Block: Run Smarter App Marketing Campaigns

June 6, 2017, 5:00 pm

≫ Next: JOIN: Stay Curious, Keep Asking

≪ Previous: Driving Growth with Data: Mary Meeker's 2017 Internet Trends Report

tenjin_block

In today’s mobile ecosystem, app marketers have to be savvy about how they allocate their ad spend. The landscape is simply so competitive that in order to run profitable install campaigns, developers have to use a data-driven approach that lets them target the right users, partner with the right ad sources, and track their data all the way through the customer lifecycle.

Built by product and engineering experts with several years of app marketing experience, Tenjin is designed to be the most flexible, comprehensive and intuitive marketing platform in the mobile industry for the biggest apps in the world, such as Yelp and Dots, as well as for indie developers just getting started.

And after partnering with Looker earlier this year, Tenjin is now pleased to announce that the Tenjin Looker Block is available for Looker customers who want to grow their mobile apps quickly, scalably and profitably.

Tenjin’s data tracks the entire app marketing lifecycle, starting with the source of the install and continuing on to the user’s in-app behaviors so that it can monitor their full lifetime value -- including both in-app purchase (IAP) revenue and advertising-based revenue. That way, developers can map the revenue generated by each user back to the source of their install in order to understand the ROI for each campaign, allowing them to re-invest in profitable campaigns and reduce or cut out underperforming campaigns altogether.

The Tenjin Looker Block works with data from more than 100 advertising sources and aggregates them together so that app marketers can access all of their data in a single, flexible and easy-to-use dashboard.

Marketers can use the dashboard to analyze install campaigns based on any number of factors, including Cost Per Install, Cost Per Click, Click Through Rate, Total Spend, Total Installs and more. Below are some sample reports that show metrics for a specific campaign:

screenshot

The Tenjin Looker Block allows marketers to:

Analyze campaign performance by dimensions such as ad network, country, platform, date, budget and more
Get user-level data on lifetime value, IAP revenue, ad-based revenue, return on ad spend, and more
Analyze daily behaviors of app users according to any number of customizable in-app events
Track cohorts of app users and compare them to other cohorts to see how changes affect retention and engagement performance
And more!

screenshot

If you’d like to try the Tenjin Block, you can sign up with Tenjin for free here, or don’t hesitate to contact sales@tenjin.io with any questions. Learn more about the Tenjin Looker Block.

↧

JOIN: Stay Curious, Keep Asking

June 11, 2017, 5:00 pm

≫ Next: CSV to Oprah in 5 (codeless) clicks

≪ Previous: Tenjin Looker Block: Run Smarter App Marketing Campaigns

Join Stay Curious Keep Asking

When we were kids, we were full of questions. And we weren’t afraid to ask “why?” to just about everything. As a mom of four, this “why” stage can be irritating at times, but I know my kids are asking because they want to learn. And as they learn more, they will make better decisions and have more interesting insights about the world around them. Believe it or not, within a business, the process is no different.

Join Working with Looker

Last year in New York we hosted our first user conference JOIN 2016, bringing some of the most innovative minds in data together to collaborate, share ideas and connect over what data can make possible for business and even for the world. We were amazed with the variety of ways customers were using data to better understand their businesses. Buzzfeed’s data science team spoke on optimizing content publishing while Bonobos, Jet.com, and Casper discussed accelerating their ecommerce businesses. All of these talks had one thing in common. Here were data people who were not afraid to keep asking ‘why.’

Join 2016

As we’ve been talking to data people over the last year, we are continuing to hear this same theme. Companies like Gilt and DigitalOcean are setting themselves apart from their competition because of their drive to go beyond the dashboard and keep asking why.

JOIN 2017 - this year in San Francisco - will bring together all of these curious-minded people to share ideas, make friends and, together, continue to ask why. Already data experts from GitHub, IndieGoGo, and Blue Apron are scheduled to share their ideas on how to drive that spirit of curiosity in a business as well as their stories about how they are getting answers from their data.

So this year we hope you’ll come to JOIN 2017, to jump into the conversation, share your data stories and inspire everyone around you to keep asking.

JOIN 2017
Keep Asking

Looker Platform

↧

CSV to Oprah in 5 (codeless) clicks

June 12, 2017, 5:00 pm

≫ Next: Introducing MemSQL Cloud: The Real-Time Data Warehouse for Any Cloud

≪ Previous: JOIN: Stay Curious, Keep Asking

Flat-file formats like CSVs are a simple, universal way to store data. But as data gets bigger and bigger, getting a handle on what’s in them can get tricky. In fact, even loading them may be too much for your computer to handle.

Developers might use command-line tools like sed and awk to work with these files, but let’s be honest, UNIX commands aren’t going to cut it for normal folks.

That’s why we’re adding Instant Insight to Looker. While examining millions of rows is probably too much for your laptop to handle, today’s massive databases aren’t phased even by rows in the billions. Looker’s Instant Insight lets you leverage all the power of those databases, no coding required.

To try it out, we grabbed all 3 million rows of President Obama’s White House Visitor Logs from the archived whitehouse.gov site (each year’s data is a few hundred MB--way more than Excel can comfortably handle). We uploaded the file into Google Cloud Storage, then loaded them into BigQuery (letting BigQuery detect the schema automatically.)

Because Looker is entirely web-based, we even built a Chrome extension that takes us to Instant Insight directly from BigQuery. So now, in just 5 clicks we can explore that CSV in Looker and actually get some insights out of it.

So, about those 5 clicks...

Now that we have this data available to explore in Looker, we can start to poke around. So, first, let’s see what we’re looking at. Counting things isn’t Excel’s strong suit, but Looker’s interface makes it simple.

Right off the bat, we know that this data is not complete. We can see here that 2009, 2013 and 2014 had significantly fewer visitors than 2010 and 2011. And the data from 2012 is missing entirely. Good to know!

Now that we know that, let’s have some fun. So, first things first, who was visiting POTUS and FLOTUS at the White House? (In case you don’t know, those acronyms stand for President of the United States and First Lady of the United States, and yes, they really use them in the White House visitor logs.)

Okay, that list looks pretty standard. But since we’ve got row-level detail available, let’s see if we can find any interesting individuals. How about Oprah?

Looks like Oprah got some major facetime with POTUS and FLOTUS. Who else might have been visiting POTUS?

Jon Stewart seems a likely candidate. But since that’s not his real name….

Jon Stewart

With this knowledge, we can search for Mr. Leibowitz….

So from this data, we see two things...

Jonathan (no middle name) Leibowitz visited the White House a lot.
Names are not unique (duh). Because of this, we would need to get more data to really see if this was Jon Stewart.

Because this data is in Looker, sharing our insights is as easy as copying and pasting the URL. So for your friends who are more into political celebrities (Henry Kissinger, anybody?) than celebrity celebrities, just send them the link and they can join in the fun.

At Looker, we believe deeply in the power of analysis. But that doesn’t just mean focusing on solving the big, sexy problems (though, we are focused on those). It also means removing the obstacles that slow analysts and business users down every day. We think Instant Insight does a lot in that regard, and we’re excited for you to check it out.

↧

Introducing MemSQL Cloud: The Real-Time Data Warehouse for Any Cloud

June 15, 2017, 4:41 am

≫ Next: Marketing Conversion Rates: Clarity, Trust, and the Feedback Loop

≪ Previous: CSV to Oprah in 5 (codeless) clicks

Introducing MemSQL Cloud

Looker is excited to be a launch partner for MemSQL Cloud. We’ve worked closely with MemSQL to support its real-time data warehouse as a data source as part of our strategic technology partnership, and that support can now be extended to its new cloud offering.

Why MemSQL Cloud?

Looker customers using MemSQL have already realized advantages across deployment capabilities, scaling, real-time data ingest and querying, and more. Some things that we’d love to highlight about MemSQL Cloud are:

Flexible deployment options: Like Looker, MemSQL Cloud offers multiple deployment options that can be tailored to your unique use case. These include:
- A Fully Managed Service offers a flexible and low maintenance way to use MemSQL. Hardware, software, and management services are available with an hourly or annual subscription.
- Customer hosted and managed deployment on the public cloud, with support for popular and trusted options such as AWS and Microsoft Azure, and soon Google Cloud Platform.
- Private cloud deployment offers the ability for customers to host in their own VPC, leveraging their internal investments.
Real-time performance: Streaming ingest and low latency make MemSQL Cloud ideal for real-time reporting and analytics use cases. Support for high concurrency makes it optimal for powering dashboards with high query volumes.
Security: MemSQL Cloud is able to maintain performance while adhering to comprehensive security requirements. Data encryption is supported both at time of ingest and when delivered across nodes.

You can find even more details about MemSQL Cloud on its page.

Why Looker for MemSQL Cloud?

Looker’s core architecture directly leverages all of the advantages of MemSQL Cloud. The Looker platform connects directly to MemSQL using a standard JDBC connection, and leverages the LookML modeling layer to write SQL directly to MemSQL.

The results are then visualized in the Looker application directly or by leveraging our platform capabilities, including scheduled delivery, iframe embedding into an application or use of the Looker API.

Looker platform

In short, the Looker MemSQL stack makes it possible to:

Keep data in MemSQL: Looker doesn’t ingest or move any data from MemSQL, which means you can just leverage its enterprise-grade security.
Provide governed data access to everyone: Looker offers the ability for anyone to query MemSQL in a permissioned and governed way, fully taking advantage of your investment.
Results in real time: Since Looker queries MemSQL directly, results are as real time as the data in MemSQL. You can set up Looker dashboards to auto-refresh at the interval you’d like - down to seconds - and provide up-to-date information to your data consumers.

Interested in learning more?

Try MemSQL Cloud by requesting a demo here! And as always, request a Looker demo here. Our teams are both knowledgeable on the joint solution and can get you up and running quickly.

↧

Marketing Conversion Rates: Clarity, Trust, and the Feedback Loop

June 16, 2017, 5:02 am

≫ Next: Moving Zendesk Data into BigQuery with help from Apache NiFi's Wait/Notify Processors

≪ Previous: Introducing MemSQL Cloud: The Real-Time Data Warehouse for Any Cloud

Marketing Conversion Rates

Test, measure, iterate. The dialogue is similar with most SDR departments across tech: book more meetings, make more calls, adopt new approaches. There are so many different types of strategies, and while each approach has it’s value, I think we can all agree that these approaches are meaningless if you can’t track the effectiveness of the approach.

At Looker, our SDR team takes a multifaceted data-driven approach to everything we do.Below, I’m sharing the most valuable metrics and ‘Looks’ (aka: dashboards, if you aren’t a Looker-user) that our team uses to achieve maximum efficiency.

Clarity

Our SDR team understands the value of Looker to our process. We rely the conversion data on our Looker dashboards to influence our team’s daily workflow. From monitoring the pipeline to something as simple as tracking phone calls made, we’re always exploring data to learn how to maximize our efficiency.

When we need additional clarity regarding a specific program’s conversion, we can check the conversion metrics against what we know to be a solid benchmark number, and then update our follow-up process accordingly. For instance, if a lead comes from a new lead generation program and they are not explicitly interested in BI or analytics, we’ll know to put them on a longer educational email cadence to help us better understand their specific wants and needs.

If a lead comes to us from organic traffic and has viewed a recorded demo or other Looker-specific content, we treat these actions as high-value. We prioritize these leads and follow up with highly targeted emails and phone calls. We gauge all successes and failures within Looker and will often find that certain methods of outreach - calls, drip emails, social touches, etc - will yield the best results. In order to maximize our outreach, we strive to match the BEST messaging with the BEST leads..

Trust

Trust goes a long way. Whether it’s trusting your colleagues, trusting different departments, or trusting your company as a whole; there need to be mechanisms in place to achieve trust. There are so many mechanical and organic ways to build trust and solidarity within a company. At Looker, we utilize data to build trust. Data is the great equalizer and all of our data is visible for the entire company to see and explore.. If an SDR is underperforming or crushing their goals, the entire company can see that. The same goes for sales and marketing.

For marketing, SDRs, and sales to have point of reference and an open line of communication, it’s imperative that we understand the conversion of these leads to meetings, program funnel metrics, and the life of the lead. If a program is performing poorly, we can then track the effectiveness of our phone and email follow-up. Are our cadences bad? Are the personas poor? Not sure? This is when we ask the question in Looker.

In Looker, we are constantly assured that every team is accessing the same data from a single point-of-truth. Therefore the metrics are all going to determined the same way, every time. Since we all know we can trust the data and metrics we’re looking at, the data opens up clean lines of communication within and between teams.

The Feedback Loop

Remember the days of passing spreadsheets back and forth? You would have to save your iteration with something like “J.K.- edit” and after about 10 passes the data would lack any semblance of the initial draft.

Well, with Looker’s web-based architecture, deep-diving into a marketing campaign’s effectiveness is as easy as sharing a URL. With Looker, it’s that simple.

I can run ad hoc analysis on the data, chat it to a peer, let them dive in, and then we can discuss and take actions based our findings. It’s so dang simple and so incredibly easy to track changes and iterate on changes. Looker has become one of the best collaboration tools I’ve ever utilized.

This is the story for our finance, sales, marketing, customer success, and support teams. The feedback doesn’t only have to be siloed. Much like any mature business, we are sharing data cross-departmentally by sharing dashboards and Looks within an email, scheduled out, as an alert, through a chat, or embedded in our intranet. We have a clear understanding of our data, and we are all constantly collaborating over the data to make our business successful.

So far, I would say we’re doing pretty darn good.

Check back in a couple weeks and I’ll dive deeper into the 5 Must-Have Dashboard elements to increase SDR/BDR efficiency!

↧

Moving Zendesk Data into BigQuery with help from Apache NiFi's Wait/Notify Processors

June 19, 2017, 5:00 pm

≫ Next: Data of Thrones Part I: Screen Time, Episodes, and Death in Game of Thrones

≪ Previous: Marketing Conversion Rates: Clarity, Trust, and the Feedback Loop

zendesk

With the release of NiFi 1.2/1.3 a number of new processors were introduced, included in these are the Wait/Notify, and GCSObject processors. Using the Wait along with the Notify processor, you can hold up the processing of a particular flow until a "release signal" is stored in the Map Cache Server. These processors are very useful with a number of various ETL flows like calling a webservice to get data. Most web services return data in a paginated fashion. You’ll grab a “page” of data at a time and process, and then grab the next page and process, etc... Although you can use the MergeContent processor for this sometimes, it’s really hard to setup the configuration properly so that you delay the merging long enough before continuing on with the flow.

The initial flow that we planned on applying these processors to was our Zendesk Chat service. Although the current flow has been working fine, it does require a dependency on a third party database sync process from Zendesk to our Redshift, which only runs once an hour. Being able to process the chats directly from the Zendesk allows quicker chat metrics as well as cutting out the middleman and removing an integration point. Now were able to load directly into Google BigQuery at a much quicker 10 minute cadence.

The high level overview of the flow is as follows:

Trigger the event flow for a given date range.
Extract all the chat "sessions" for the given range, may require a looping construct.
Query and get all the messages for the given date range.
Load into Google Cloud Storage.
Load into BigQuery.

I won't cover the details of the first step, triggering the flow, but essentially we use a combination of HandleHttpRequest and GenerateFlowFile processors to do this. The HandleHttpRequest allows us to process a given range of dates in an ad-hoc manner via a simple HTTP Post, whereas the GenerateFlowFile allows for scheduling the flow via a cron expression and/or run every X seconds/minutes/hours.

Once we have the date range, the next part is to get all the "chats" from the API. We don't know ahead of time how many chats/sessions we'll have, so we'll need to factor in the need to handle pagination and some sort of looping construct. The search Zendesk API returns a next_url if there are more chats for the given search. In that case, we'll need to loop around and get the next batch of sessions and the next_url. This will continue until the next_url is NULL. Below is the flow we're using to get all the chats for a given date range.

zendesk

Since we need to be careful to not exceed the rate limit for the API, a delay is implemented that throttles each call to the search API. With each iteration, the HTTP response contains a results JSON array. The results array includes a url of the chat, starting timestamp, preview message, type attribute, and an id of the chat session. It's rather easy to extract out all the ids from the results array using JQ.

zendesk

Next, all the "pages of chat sessions" are sent downstream to a MergeContent processor which will merge them into a single FlowFile. Once all the session ids are merged, we then immediately split the entire group via the SplitText processor into chunks of 50 (the Zendesk API limit). The SplitText processor provides some useful attributes that we'll use in the Wait processor. The key one in this flow is the fragment.count. Each FlowFile resulting from the split will have a fragment.index attribute which indicates the ordering of that file in the split, and a fragment.count which is the number of splits from the parent. For example, if we have 120 chat sessions to process, and we split those into 50 sessions per chunk, we will have three chunks. Each one of these FlowFiles will have a fragment.count that is equal to three, along with a fragment.index that corresponds to the an index value in the split. The original FlowFile is sent off to the Wait processor, which we'll discuss later. At this point, we've set the stage to use the bulk chat API to efficiently extract the chat messages.

zendesk

All the split chunks are sent downstream for processing. Each "chunk of 50" is sent downstream through the flow, calling the Zendesk API to get all the chat messages for a given “chunk” of ids. A bit of JSON transformation is done in the ExecuteStreamCommand by calling JQ again. The output-stream is then compressed and written to Google Cloud Storage via the new PutGCSObject processor.

zendesk

The success relationship of the PutGCSObject is routed to the Notify processor which in-turn updates the key/value in DistrbutedMapCacheServer. The key being updated in the cache is the Release Signal Identifier which in this case is the ${filename}, and the value being updated is Signal Counter Name. Since the same flow is used for both the sessions and the chat processing, we want to use an attribute for the Signal Counter Name and expression language to set that value at runtime. The delta being applied to the update is the Signal Counter Delta. So for each file written successfully, the Signal Counter Name is incremented by 1.

Now we need to back up a bit in the flow to the SplitText processor labelled "chunk chats" where we see a fork in the processing. As I mentioned earlier, the original FlowFile will be routed to the Wait processor. The Flowfile will sit and “wait” for the appropriate "release signal" to be written to the DistrbutedMapCacheServer before proceeding. Once the "release signal" is detected, indicating all the chunks/chats have been processed and written to Google Cloud Storage, the FlowFile will then be routed to the success queue. From here the flow continues where downstream processors will load the data into a Google BigQuery daily partitioned table.

zendesk

In the event that one or more of the files are not successfully written to Google Cloud Storage, the original FlowFile in the wait queue will eventually (10 min default) expire. This will result in FlowFile being routed to the expired queue. The expired queue triggers a message being sent to an Amazon SNS Topic that will notify me with the FlowFile identifier. I’m also notified in the same way if any of the copies to Google Cloud Storage fail. Using the identifier, I can quickly and easily lookup the Provenance Event to see what the problems was.

zendesk

The addition of the Wait/Notify processors along with the GCSObject processors further expands and extends the usefulness of NiFi. The flexibility of being able to adjust and change workflows on the fly in NiFi is amazing. We’re constantly finding additional ways to leverage NiFi to enhance our abilities to efficiently consolidate more data sources into Google BigQuery.

↧

Data of Thrones Part I: Screen Time, Episodes, and Death in Game of Thrones

June 20, 2017, 5:00 pm

≫ Next: Kustomer + Looker Put Support Data Through a New Lens

≪ Previous: Moving Zendesk Data into BigQuery with help from Apache NiFi's Wait/Notify Processors

data_of_thrones

Looker is home to many Game of Thrones nerds, and recently we got our hands on some datasets describing the TV show and books. This is Part I of IV posts where we will dive into these datasets and share what we learned from the numbers on our favorite show - Game of Thrones.

There are a fair few resources on George R R Martin’s books (more on that later), but datasets on the TV show were much harder to find. In the end, we landed on two: a dataset on screen time from data.world and deaths in Game of Thrones by Kaggle. With some SQL magic from our analysts, we were able to join the two datasets together, as well as add some extra background from the books fact table. While there were a few limitations with this method, the data mapped surprisingly well between the book series and the TV series.

25 Things We Learned by Looking at Data on Game of Thrones

While simple, episode count and screen time tell us a lot about how the showrunners are thinking about the characters we know and love (or hate).

A character’s time on screen and episode inclusion is no accident. The writers of the show are putting characters in episodes for specific reasons and the amount of time they actually spend on screen is also conscious choice.

These two pieces of information combined with the death data allows us to dig under the surface of the show’s storyline and look for trends and patterns in the show itself, while also examining our own perception of the show and how it differs from what the data shows.

CAUTION : SPOILER ALERT

First, let’s look at Screentime and episode count:

Tyrion is definitely the star of the show - he has both the most episodes AND the most screen time. Now, this isn’t that surprising, until you look at how much of a lead Tyrion has in screen time. He has more than 25 minutes of screen time than the next character and more than 70 minutes more than the third!
The next two screentime leaders are Jon and Daenerys, and they are also tied for episode count. Again, these top three are no surprise (is it a sign?!), but the 47 minute difference between Jon’s and Daenerys’ screen time was really interesting. While Jon has for the most part lead his own story line, Daenerys’ storyline has been entirely her own. This data shows that Jon’s story has been given more screen time than Daenerys’
Cersei, Sansa, and Arya come next - with Cersei only leading Sansa by about a minute. I have always thought of Cersei as being a much larger character than Sansa, but according to screen time, that is not the case. I found this interesting because Cersei feels more important than Sansa throughout the show, with Sansa only really taking control over her story in Season 6. This is a clear juxtaposition of screen time vs. power in the story. Cersei has definitely done more to change the story than Sansa, but they both have had the camera pointed at them for virtually the same amount of time. What does that mean for these two characters from the perspective of the show creators? I don’t know that I have an answer, but I definitely have more questions.
Sansa has more screen time than Arya. Again, I underestimated Ms. Sansa Stark until now. It’s similar to the Jon vs. Dany realization - while Arya has been leading her own story and therefore feeling more powerful and important, the show creators haven’t given it as much time as Sansa’s story.
Ned Stark somehow, comes in at #12 on this list, even though he died in Season 1. His total screen time has gotten some help from return via flashback in Season 6, but even so, he has had more screen time than most of the characters we see today. No wonder his loss in Season 1 was so jarring, he was in practically the whole thing.
Some unintuitive character pairs & groups show up on this list. For example, Theon, Sam are very close to tied in screen time, and Brienne and Davos literally have the same amount of time on screen. Catelynn and Varys are about the same, and Tywin Lannister, Margaery and Robb are all close as well. Ramsay, Melisandre and Bronn are grouped together, and so are Gilly and Ygritte. This doesn’t provide a ton of insight, but it does show us who the show creators think are at similar levels of importance.

Now, if we keep screen time, but sort by episode some new insights show up:

While she may not be on top in screen time, Cersei is second in overall episode count (following her youngest brother, which I’m sure she would just love). This jump is pretty interesting when you consider Cersei’s methods of power in the early seasons. She is mostly working behind the scenes (or not really working at all)
Eddison Tollett is a surprising high ranker in the episode count list. He even has more episodes than Tywin and Podrick! This was especially surprising to me, because I didn’t even know who he was when I first saw his name on the list. I have since looked him up and I’m still surprised.
Missandei has been in a lot of episodes but hasn’t had very much screen time. While her role as Dany’s right hand lady explains her lack of time on screen, her inclusion in so many episodes leads to questions about her role in the show - are they leading up to something?
Varys’ and Pycelle’s high ranking on this list is also telling. We know them both to be “behind the scenes manipulators” so their lack of screen time makes a lot of sense. They are always there but purposefully out of the spotlight...
On the other end of the spectrum, this visualization highlights characters with high screen time but low episodes - Oberyn, Robert Baratheon and Ned Stark to name a few. I’ll talk more about this later on in this post, but it’s clear that “flash in the pan” characters are common in Game of Thrones...

Now let’s move on to the fun stuff - deaths!

Game of Thrones is known for killing off characters, so we examined a dataset that covers all named character deaths and when they happened. Let’s go through a few ways to look at this, starting with Deaths by Season.

data_of_thrones

Season 3 is the least deadly of the show so far, which is interesting because it had, arguably, the goriest murder scene of main characters in the show (Hi, Red Wedding)
Season 6 was the deadliest season yet… by a lot

But we can also slice this data in other ways - ever wondered which episodes are safest to DVR for Monday and not be ruined by Facebook spoilers? Let’s look at deaths by episode number...

data_of_thrones

When visualized, the answer to the question is pretty clear: waiting a day on episode 6 is your safest bet, followed by episodes 3 and 8… No promises though, because there has not been a consistently safe episode in the series as a whole
Overall, episode 10 is where we’ve lost the most characters. So don’t miss the finale, or if you have to, turn your eyes away from the internet until you have a chance to watch
A surprising tip from this chart is the spike in deaths in Episode 5. No procrastinating on the midseason episode, my friends!

In looking at deaths, I realized that not every named character death had the same impact on the storyline or our experience as viewers, so I changed my query to look at total screen time of the characters killed off in each season (assuming more screen time = more important, impactful deaths). Check out the new view:

data_of_thrones

(Note: I removed Jon Snow from this viz. He completely skewed Season 5, and for the purposes of this post, I’m focusing on characters who died and stayed dead)

This visualization challenges the logical assumption that overall, characters killed in later seasons would have more screentime because the show had just been around longer. Instead, it shows a cycle of characters getting very well known very quickly and then getting killed off in the same season or a few seasons later, while still maintaining the core character base. Remember those “flash in the pan” characters I mentioned earlier?
The jump in Season 4 is also notable. While there were fewer deaths in the season, the characters killed had almost the same combined screen time as Season 6
Season 1 killed the same amount of characters as Season 2 but we lost double the screen time (entirely because of Ned Stark)
Season 3 killed far fewer characters than Season 2, but those characters had much more screen time overall (Hi again, Red Wedding), which actually makes it more impactful deathwise

But we can get even more granular here. Let’s break it down by episode:

data_of_thrones

So far, while episode 10 is definitely the deadliest episode for characters we’ve been introduced to, but this a new trend...
Seasons 1-3 were all about killing big characters in Episode 9, But starting in season 4 they started killing everyone in episode 10
In comparison to the rest of the show, Season 2 was very tame
No one of note has died in Episode 6 in the last three seasons
The death story arc changed halfway through the show - in the visualization above, look at how 1-3 compare and then 4-6. In early seasons, they killed very few characters before episode 5, saving them all for the end. As the show goes on, there is a bump from Episode 1-5 that completely ends at Episode 6 and then grows to a peak at Episode 10

Insights from this data range from surprising to mildly interesting, and it all leads to even more questions.

In the next few weeks leading up to the Season 7 premiere on July 17th, we will be diving into different areas of this data on Game of Thrones and looking at the show in new dimensions.

Subscribe to the blog to stay in the know on Looker’s latest posts, both GoT related and otherwise, and if you are interested in seeing what Looker will allow you to uncover in your data you can request a demo here! Thanks for reading :-)

Disclaimer: Game of Thrones belongs to HBO and is not affiliated with Looker in anyway.

↧

Kustomer + Looker Put Support Data Through a New Lens

June 23, 2017, 5:54 am

≫ Next: The JOIN 2017 Agenda is Here and it Rocks

≪ Previous: Data of Thrones Part I: Screen Time, Episodes, and Death in Game of Thrones

kustomer

The Customer Support function is playing an increasingly strategic and valuable role in companies. Today, we are excited to unveil a new powerful solution that uncovers valuable insights and trends about your customers. We have partnered with Looker to create a solution that enables companies to integrate their support team’s data into broader, company-wide insights and analysis.

First, a little bit about us. Kustomer is the first customer support platform designed and built around the customer. Kustomer brings together your customer data and conversations in one place to give you a comprehensive view of your customer.

Companies using Kustomer easily create a real-time data export stream of support activity into their data warehouse.

The Kustomer & Looker Block

Kustomer and Looker are a perfect match: Kustomer gives you one place to view all of your customer support data and Looker is a great way to help you share, combine, and analyze that information at an operational level for your whole company.

kustomer

To make it even easier to get started with your analysis, we’ve created a Looker Block for Support Analytics by Kustomer. Looker Blocks make it easy for companies to quickly deploy expertly built, tailored solutions specific to each business unit or data source. They are also a great way for partners like us to make the data we’re replicating into your data warehouse immediately actionable.

The Looker Block for Kustomer allows you to easily explore your customers, conversations, and teams data to provide a comprehensive view of Customer Support team operations. Some example metrics include:

Key statistics for support team members, including average time to first completion
Average time to first response and average number of messages in conversations
Conversation status and volume by channel

kustomer

The Complete View of your Customer

So, why is using Looker on Kustomer data valuable? Kustomer gives you the context of the customer beyond an individual transaction and in Looker you can link this data with other operational data in your data warehouse, including application data and home-grown systems.

This union of customer data enables you to perform the analysis that can result in actions that increase loyalty, improve customer lifetime value, and help customers get what they need. You can use Looker to uncover intelligent insights not only in your customer data but in your enterprise data too.

Want to learn more?

Check out the Support Analytics by Kustomer Block. Request a demo of Kustomer here or Looker here!

↧

The JOIN 2017 Agenda is Here and it Rocks

June 27, 2017, 5:00 pm

≫ Next: Data of Thrones Part II: Women on Screen

≪ Previous: Kustomer + Looker Put Support Data Through a New Lens

Join Agenda

Ask any of my friends or colleagues, they will tell you that I’m easily excitable. With that said...this agenda is killer. The JOIN team has really pulled out all the stops when it comes to this content.

Last year we talked about disrupting with data; this year we want to encourage people to always keep asking why. And who better to encourage people to keep asking why, than Simon Sinek — the man who encourages millions to Start with Why and create a world where everyone feels fulfilled and inspired by their work.

After the keynote, we’ll have three main tracks of content. Below, I’ve highlighted some of the sessions I’m most excited about...

Our Data Stories will shine a spotlight on customers who are changing their business with data and Looker.

Taking Business Intelligence from Historical to Real-time at Mic. Learn how Mic uses DynamoDB, Spark Streaming, Looker and other technologies to encourage real-time decision making.
How Coursera Brings Education for Tomorrow to Millions Today. Hear how Coursera developed and deployed an algorithm to help them provide the world’s best learning experience - everything from building affinity analysis to understanding the ROI on skills.
GitHub: How I got Dr. Dre Excited About Data. Successful data teams do far more than manage tools — they create a foundation strong enough to build and support a data culture. Github will talk about its experience changing culture at Beats Music, TuneIn, and GitHub, including how GitHub uses GitHub to manage Looker models.

Plus Speakers from: BuzzFeed, WeWork, Blue Apron, Twilio, Kiva and more!

Tech Talks will cover overarching issues & solutions to help you and your company be better users of not only Looker, but data in general.

Managing and Scaling with Large Datasets will talk about features both within and outside of Looker to help scale data access for large and complex datasets.
We’re opening up and talking about How we Looker at Looker. Looker leads from Support, Customer Success, and Demand Generation will walk you through the good, the bad and the creative pieces of our implementation.

Deep Dives will bring it all together with sessions on how to implement specific technologies and analyses.

Learn how to use Looker’s smart caching feature, which allows you to sync Looker with your ETL process to deliver faster and more timely dashboards.
Do you know about seamless workflows in Looker with Zapier and Data Actions? You can add everything from SMS Alerts and Slack notifications to options for pushing data into apps like Salesforce or Adwords.
All things table calculations. Learn how Table Calcs could give you greater control over how you display and think about your data. We’ll go over the basics, then dig into row-over-row calculations, pivot offsets, and more!

I hope this peek into the agenda gets you as excited about the content as I am. Our goal was to curate content that will spark new ideas for how to use Looker, train you on best practices, and teach you how to get more and more of your company drilling into the data and making more informed decisions.

Head to the website to check out the full agenda.

↧