MemSQL¶
Infrastructure¶
2 VMs provisioned on Google Cloud Platform.
- VM#1
- Machine type - 8 vCPUs, 30 GB memory
- CPU platform - Intel Broadwell
- 50GB SSD persistent disk
- source image - debian-9-stretch-v20180307
- MemSQL - Master aggregator and a leaf node
- VM#2
- Machine type - 16 vCPUs, 30 GB memory
- CPU platform - Intel Broadwell
- 90GB SSD persistent disk
- source image - debian-9-stretch-v20180307
- MemSQL - Two leaf nodes
Install MemSQL¶
Download the software:
cd /tmp
wget http://download.memsql.com/memsql-ops-6.0.10/memsql-ops-6.0.10.tar.gz
tar zxvf memsql-ops-6.0.10.tar.gz
Run the installer:
cd memsql-ops-6.0.10
sudo ./install.sh
Make the second VM follow the first one. At this point MemSQL Ops Web UI should be able to see both VMs as part of cluster:
memsql-ops follow -h <master-vm-ip>
From the MemSQL Ops Web UI, add MemSQL nodes with one master and three leaf node roles across the two VMs.
For starting/stopping the cluster use the below commands from the master VM.:
memsql-ops memsql-stop --all
memsql-ops memsql-start --all
Schema Design¶
Create database:
create database mymemsqldb;
show databases;
use mymemsqldb;
Create yellow_trips table:
create table mymemsqldb.yellow_trips (
vendor_id VARCHAR(5),
pickup_datetime DATETIME,
dropoff_datetime DATETIME,
pickup_longitude DECIMAL(15,8),
pickup_latitude DECIMAL(15,8),
dropoff_longitude DECIMAL(15,8),
dropoff_latitude DECIMAL(15,8),
rate_code VARCHAR(5),
passenger_count TINYINT,
trip_distance FLOAT,
payment_type VARCHAR(5),
fare_amount FLOAT,
extra FLOAT,
mta_tax FLOAT,
imp_surcharge FLOAT,
tip_amount FLOAT,
tolls_amount FLOAT,
total_amount FLOAT,
store_and_fwd_flag VARCHAR(5),
key (pickup_datetime,dropoff_datetime,vendor_id) using clustered columnstore,
shard key (pickup_datetime, dropoff_datetime));
Copy yellow taxi dataset from google bucket to VM:
gsutil cp gs://yellow_taxi_trips/data/trips0000000000*.gz .
The above downloads all the ~1000 .gz compressed csv files on to local VM disk.
MemSQL can directly load .gz files in to tables:
load data infile '/home/suds/trips00000000*.gz' into table mymemsqldb.yellow_trips columns terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 1 lines (vendor_id,@pickup_datetime,@dropoff_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,rate_code,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,imp_surcharge,tip_amount,tolls_amount,total_amount,store_and_fwd_flag) set pickup_datetime = STR_TO_DATE(SUBSTR(@pickup_datetime,1,19), '%Y-%m-%d %H:%i:%s'),dropoff_datetime = STR_TO_DATE(SUBSTR(@dropoff_datetime,1,19), '%Y-%m-%d %H:%i:%s');
Average Ingestion rate - ~200,000/s. Total Rows - 1,108,779,463 Total Size - 34 GB (62% compression)
Performance tests¶
Query#1¶
How many trips did Yellow taxis took each month?:
SELECT DATE_TRUNC('month',pickup_datetime) month, COUNT(*) trips
FROM mymemsqldb.yellow_trips
GROUP BY 1
ORDER BY 1;
#1 - 78 rows in set (6.66 sec)
#2 - 78 rows in set (5.90 sec)
#3 - 78 rows in set (5.08 sec)
Query#2¶
What was the average speed of Yellow taxi trips in 2015 by hour?:
SELECT EXTRACT(HOUR FROM pickup_datetime) hour,
ROUND(AVG(trip_distance / TIMESTAMPDIFF (SECOND,pickup_datetime,dropoff_datetime))*3600, 1) speed
FROM mymemsqldb.yellow_trips
WHERE trip_distance > 0
AND fare_amount/trip_distance BETWEEN 2 AND 10
AND dropoff_datetime > pickup_datetime
AND pickup_datetime >= '2015-01-01' AND pickup_datetime < '2016-01-01'
GROUP BY 1
ORDER BY 1;
#1 - 24 rows in set (4.12 sec)
#2 - 24 rows in set (3.22 sec)
#3 - 24 rows in set (1.51 sec)
Query#3¶
What was the average speed of Yellow taxi trips in 2014 by day of week?
SELECT DAYOFWEEK(pickup_datetime) DAY_OF_WEEK,
ROUND(AVG(trip_distance / TIMESTAMPDIFF(SECOND,pickup_datetime,dropoff_datetime))*3600, 1) speed
FROM mymemsqldb.yellow_trips
WHERE trip_distance > 0
AND fare_amount/trip_distance BETWEEN 2 AND 10
AND dropoff_datetime > pickup_datetime
AND pickup_datetime >= '2014-01-01' AND pickup_datetime < '2015-01-01'
GROUP BY 1
ORDER BY 1;
#1 - 7 rows in set (6.61 sec)
#2 - 7 rows in set (4.81 sec)
#3 - 7 rows in set (2.88 sec)
Query#4¶
Show the first 10 rows offset 100 in 2013 for CMT vendor, trip_distance > 10 and #passengers > 1:
SELECT *
FROM mymemsqldb.yellow_trips
WHERE trip_distance > 10
AND passenger_count > 1
AND vendor_id = 'CMT'
AND pickup_datetime >= '2013-01-01' AND pickup_datetime < '2014-01-01'
ORDER BY pickup_datetime
LIMIT 10 OFFSET 100;
#1 - 10 rows in set (4.89 sec)
#2 - 10 rows in set (0.66 sec)
#3 - 10 rows in set (0.31 sec)