One of the biggest selling points for having Google Analytics 360 is its integration with Google BigQuery, a tool that enables you to query and run analytics against your GA data at hit level.
But just having access to the tool and such granular data doesn’t always mean that you can derive information from it. As you know, garbage in — garbage out.
Whenever working with data, it is always good to have a reference point. For that purpose let’s look at the list of session ids, the count of which should match the count of sessions number in the GA interface.
Before writing a SQL statement, let’s look at the data schema to see what field names can give us the answer we are looking for:
fullVisitorId– the unique visitor ID (aka client ID). Count of this can give us unique users number, but not sessions number.
visitId– An identifier for this session. Count of this may give us total sessions number.
totals.visits– The number of sessions (for convenience). This value is 1 for sessions with interaction events. The value is null if there are no interaction events in the session.
Important observation #1. The number of sessions in BigQuery is always greater than in the Google Analytics interface. In our case the discrepancy is 2% due to the fact that GA automatically filters out sessions with no interaction events. To take into account that condition, it is necessary to have “where totals.visits = 1″ statement.
Let’s write a SQL statement that will use a combination of different fields and different counting methods. Let’s count unique visit_ids (the most obvious way to solve the question), let’s sum totals.visits, let’s use exact_count formula to count visit_id and some other stuff:
SUM(totals.visits) AS totals_visits,
COUNT(DISTINCT fullvisitorid) AS distinct_fullvisitorid,
COUNT(DISTINCT visitid) AS distinct_visitid,
COUNT(DISTINCT CONCAT(CAST(fullvisitorid AS string),CAST(visitid AS string))) AS distinct_visitid_fullvisitorid,
EXACT_COUNT_DISTINCT(CONCAT(CAST(fullvisitorid AS string),CAST(visitid AS string))) AS exact_distinct_visitid_fullvisitorid,
EXACT_COUNT_DISTINCT (visitid) AS exact_distinct_visitid,
FROM (TABLE_DATE_RANGE([myproject:mydataset.ga_sessions_], TIMESTAMP(‘2017–07–28’), TIMESTAMP(‘2017–07–28’)))
totals.visits = 1
Important observation #2. To get a count of unique values, I have been using the COUNT(DISTINCT metric_name) method. It turns out, this formula was only giving me approximate results. To get an accurate number we should be using the EXACT_COUNT_DISTINCT(metric_name) method.
Run the above query against your own data and see the difference between those methods yourself. Try it with and without “total.visits where” statement.
As you will prove using your own data there are 2 ways to get the correct count of total sessions number, that will match the value from GA interface:
a) SUM(totals.visits) which uses the “totals” metric created for you by Google.
b) EXACT_COUNT_DISTINCT(concat(cast(fullvisitorid as string),cast(visitid as string))) a method that makes more sense to me. This method also gives you a view on session_ids for in-depth analysis.
Important observation #3. Count of unique visitids doesn’t give a meaningful number. It turns out multiple users can have the same visitid. That’s why for unique representation of a session we need to use fullvisitorid+visitid string.