PolarSPARC

Introduction to Apache Drill


Bhaskar S 10/14/2022


Overview

In any typical Enterprise, data is being continuously generated by the various systems (that are spread across) in different formats (csv, json, parquet, etc) and being persisted in different data stores (nosql, relational, s3, etc). In order to make these various data sets accessible to the data analysts, engineers, and scientists, they are run thorugh a complex set of ETL (Extract, Transform, and Load) processes for storage into a big data analytics repository (aka data lake).

Rather than running a complex ETL processing plant to move the various data sets to the Enterprise lake from the different data sources, is there an easier way for users to access and perform adhoc queries or analysis on these data sets ???

This is where Apache Drill comes in handy !!!

Apache Drill is a high-performance, low-latency, open-source distributed query engine, which enables in the exploration of data that is in either structured and/or semi-structured, using the industry standard ANSI SQL query language.

It allows one to seemlessly aggregate data from various sources (including data files - csv, json, parquet) as well as data stores (hadoop, nosql, relational, s3) for querying and analysis.

The following illustration depicts the high-level architecture for Apache Drill deployment:


High Level Architecture
Figure-1

The query engine of Apache Drill is referred to as the Drillbit. The distributed cluster of Drillbit nodes is managed through Zookeeper.

Requests from the user community (via query tools or BI tools) goes via the distributed Drillbit cluster, where one Drillbit node is choosen as the gateway for all requests and responses (referred to as the Foreman).

Drillbit will parse the user query, create a logical execution plan, run through the query optimizer, and finally convert to a physical plan for execution.

Installation and Setup

The installation will be on a Ubuntu 22.04 LTS based Linux desktop. Also, let us assume the logged in user be polarsparc with the home directory at /home/polarsparc. We will also assume that docker has been installed on the system.

We will create a directory structure as depicted below:


Top Level Directory Structure
Figure-2

To create the desired directory structure, execute the following commands:

$ mkdir -p $HOME/drill $HOME/drill/docker $HOME/drill/conf $HOME/drill/data $HOME/drill/log

$ mkdir -p $HOME/drill/data/csv/D1 $HOME/drill/data/csv/D2

$ mkdir -p $HOME/drill/data/parquet/D1 $HOME/drill/data/parquet/D2 $HOME/drill/data/parquet/D3

We will need to build a custom docker image for Apache Drill with the flexibility to run specific commands and to be able to mount volumes.

The following is the Dockerfile located in $HOME/drill/docker:


Dockerfile
# Base OS/JDK image to use
FROM adoptopenjdk/openjdk11:ubuntu-slim

# Metadata
LABEL Version="1.0" \
      Author="Bhaskar.S" \
      Email="bswamina@polarsparc.com"

# Enviornment variable(s)
ENV DRILL_VERSION=1.20.2 \
    DRILL_USER=drill \
    DRILL_GID=1000 \
    DRILL_UID=1000 \
    DRILL_HOME=/opt/drill \
    DRILL_PID_DIR=/opt/drill \
    DRILL_LOG_DIR=/opt/drill/log \
    DRILL_CONF_DIR=/opt/drill/conf \
    DRILL_DATA_DIR=/data

# Install required core dependencies
RUN apt-get -qq update && \
    apt-get install -y ca-certificates -qq && \
    apt-get install -y wget -qq && \
    apt-get clean

# Prepare Apache Drill environment - user, group, directories
RUN mkdir -p "$DRILL_HOME" "$DRILL_LOG_DIR" "$DRILL_CONF_DIR" "$DRILL_DATA_DIR" && \
    groupadd -r "$DRILL_USER" --gid="$DRILL_GID" && \
    useradd -r -g "$DRILL_USER" --uid="$DRILL_UID" -d "$DRILL_HOME" "$DRILL_USER" && \
    chown -R "$DRILL_USER":"$DRILL_USER" "$DRILL_HOME" && \
    chown -R "$DRILL_USER":"$DRILL_USER" "$DRILL_DATA_DIR"

# Install Apache Drill software
RUN wget -q "https://www.apache.org/dist/drill/$DRILL_VERSION/apache-drill-$DRILL_VERSION.tar.gz" -P /tmp && \
    tar -xzf "/tmp/apache-drill-$DRILL_VERSION.tar.gz" -C /tmp && \
    mv "/tmp/apache-drill-$DRILL_VERSION/"* "$DRILL_HOME" && \
    rm -rf "/tmp/apache-drill-$DRILL_VERSION.tar.gz" "/tmp/apache-drill-$DRILL_VERSION"

# Set the Home directory
WORKDIR $DRILL_HOME

# Expose the desired network ports
EXPOSE 8047 31010 31011 31012

# Set the PATH variable
ENV PATH=$PATH:$DRILL_HOME/bin

# Define volume mount point(s)
VOLUME $DRILL_CONF_DIR
VOLUME $DRILL_LOG_DIR
VOLUME $DRILL_DATA_DIR

To build the cutom docker image, execute the following command:

$ docker build -t bswamina/apache-drill:1.20.2-openjdk-11 .

The following would be the typical output:

Output.1

Sending build context to Docker daemon  65.42MB
Step 1/12 : FROM adoptopenjdk/openjdk11:ubuntu-slim
ubuntu-slim: Pulling from adoptopenjdk/openjdk11
fb0b3276a519: Pull complete 
6545d0570982: Pull complete 
aea04a60246f: Pull complete 
4834d3313655: Pull complete 
Digest: sha256:7fafb2739d77ab5b8a31ed13b12af7d5d0bfb90dbd88e484bd4ec8d2a34791a3
Status: Downloaded newer image for adoptopenjdk/openjdk11:ubuntu-slim
  ---> 24c3d508d46a
Step 2/12 : LABEL Version="1.0"       Author="Bhaskar.S"       Email="bswamina@polarsparc.com"
  ---> Running in d1951aa8807c
Removing intermediate container d1951aa8807c
  ---> f3275f626c56
Step 3/12 : ENV DRILL_VERSION=1.20.2     DRILL_USER=drill     DRILL_GID=1000     DRILL_UID=1000     DRILL_HOME=/opt/drill     DRILL_PID_DIR=/opt/drill     DRILL_LOG_DIR=/opt/drill/log     DRILL_CONF_DIR=/opt/drill/conf     DRILL_DATA_DIR=/data
  ---> Running in 0292d7a5e273
Removing intermediate container 0292d7a5e273
  ---> 7ee83052f5dd
Step 4/12 : RUN apt-get -qq update &&     apt-get install -y ca-certificates -qq &&     apt-get install -y wget -qq &&     apt-get clean
  ---> Running in a6f6b3664add
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package wget.
(Reading database ... 7582 files and directories currently installed.)
Preparing to unpack .../wget_1.20.3-1ubuntu2_amd64.deb ...
Unpacking wget (1.20.3-1ubuntu2) ...
Setting up wget (1.20.3-1ubuntu2) ...
Removing intermediate container a6f6b3664add
  ---> 64fe00a5160d
Step 5/12 : RUN mkdir -p "$DRILL_HOME" "$DRILL_LOG_DIR" "$DRILL_CONF_DIR" "$DRILL_DATA_DIR" &&     groupadd -r "$DRILL_USER" --gid="$DRILL_GID" &&     useradd -r -g "$DRILL_USER" --uid="$DRILL_UID" -d "$DRILL_HOME" "$DRILL_USER" &&     chown -R "$DRILL_USER":"$DRILL_USER" "$DRILL_HOME" &&     chown -R "$DRILL_USER":"$DRILL_USER" "$DRILL_DATA_DIR"
  ---> Running in a890485d08cb
Removing intermediate container a890485d08cb
  ---> 8204557d644e
Step 6/12 : RUN wget -q "https://www.apache.org/dist/drill/$DRILL_VERSION/apache-drill-$DRILL_VERSION.tar.gz" -P /tmp &&     tar -xzf "/tmp/apache-drill-$DRILL_VERSION.tar.gz" -C /tmp &&     mv "/tmp/apache-drill-$DRILL_VERSION/"* "$DRILL_HOME" &&     rm -rf "/tmp/apache-drill-$DRILL_VERSION.tar.gz" "/tmp/apache-drill-$DRILL_VERSION"
  ---> Running in e50d9a975442
Removing intermediate container e50d9a975442
  ---> 5a0bca80a5d4
Step 7/12 : WORKDIR $DRILL_HOME
  ---> Running in 61e8365ba71b
Removing intermediate container 61e8365ba71b
  ---> f99f64cd5dd3
Step 8/12 : EXPOSE 8047 31010 31011 31012
  ---> Running in 913afc603967
Removing intermediate container 913afc603967
  ---> 27e5b35764c9
Step 9/12 : ENV PATH=$PATH:$DRILL_HOME/bin
  ---> Running in b65837c9d0dd
Removing intermediate container b65837c9d0dd
  ---> 62885db6dda4
Step 10/12 : VOLUME $DRILL_CONF_DIR
  ---> Running in 0e5158b975f0
Removing intermediate container 0e5158b975f0
  ---> 1b68e4deec81
Step 11/12 : VOLUME $DRILL_LOG_DIR
  ---> Running in 92c467e06110
Removing intermediate container 92c467e06110
  ---> a9ae6e40f187
Step 12/12 : VOLUME $DRILL_DATA_DIR
  ---> Running in 5dedba421c2d
Removing intermediate container 5dedba421c2d
  ---> 4ae42602654c
Successfully built 4ae42602654c
Successfully tagged bswamina/apache-drill:1.20.2-openjdk-11

To list all the docker images, execute the following command:

$ docker images

The following would be the typical output:

Output.2

REPOSITORY               TAG                 IMAGE ID       CREATED          SIZE
bswamina/apache-drill    1.20.2-openjdk-11   104e2dbc7c32   49 seconds ago   876MB
adoptopenjdk/openjdk11   ubuntu-slim         6c88963921ad   3 days ago       372MB

To ensure the docker image was built right, execute the following command:

$ docker run --rm bswamina/apache-drill:1.20.2-openjdk-11 ls /opt/drill/bin

The following would be the typical output:

Output.3

auto-setup.sh
drill-am.sh
drillbit.sh
drill-conf
drill-config.sh
drill-embedded
drill-embedded.bat
drill-localhost
drill-on-yarn.sh
hadoop-excludes.txt
runbit
sqlline
sqlline.bat
submit_plan
yarn-drillbit.sh

One can pull the docker image from the Docker Hub repository using the following command:

docker pull bswamina/apache-drill:1.20.2-openjdk-11

Now, we shift gears to setup the configuration files and the data files.

The following is the config file logback.xml that is used to control logging by Apache Drill and is located in $HOME/drill/conf:


logback.xml
<?xml version="1.0" encoding="UTF-8" ?>
<!--

    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.

-->
<configuration>
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
      <encoder>
        <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
        </pattern>
      </encoder>
    </appender>

    <appender name="QUERY" class="ch.qos.logback.core.rolling.RollingFileAppender">
      <file>${log.query.path}</file>
      <rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
        <fileNamePattern>${log.query.path}.%i</fileNamePattern>
        <minIndex>1</minIndex>
        <maxIndex>10</maxIndex>
      </rollingPolicy>
      <triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
        <maxFileSize>100MB</maxFileSize>
      </triggeringPolicy>
      <encoder>
        <pattern>%msg%n</pattern>
      </encoder>
    </appender>

    <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
      <file>${log.path}</file>
      <rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
        <fileNamePattern>${log.path}.%i</fileNamePattern>
        <minIndex>1</minIndex>
        <maxIndex>10</maxIndex>
      </rollingPolicy>
      <triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
        <maxFileSize>100MB</maxFileSize>
      </triggeringPolicy>
      <encoder>
        <pattern>%date{ISO8601} [%thread] %-5level %logger{36} - %msg%n</pattern>
      </encoder>
    </appender>

    <logger name="org.apache.drill" additivity="false">
      <level value="info" />
      <appender-ref ref="FILE" />
    </logger>

    <logger name="query.logger" additivity="false">
      <level value="info" />
      <appender-ref ref="QUERY" />
    </logger>

    <root>
      <level value="error" />
      <appender-ref ref="STDOUT" />
    </root>
</configuration>

Under the hood Apache Drill uses another framework called Apache Calcite. The following is the config file saffron.properties that is used to control the character set and is located in $HOME/drill/conf:


saffron.properties
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This properties file is used by Apache Calcite to define allowed charset in string literals,
# which is by default ISO-8859-1.
# Current configuration allows parsing UTF-8 by default, i.e. queries that contain utf-8 string literal.
# To take affect this file should be present in classpath.

saffron.default.charset=UTF-16LE
saffron.default.nationalcharset=UTF-16LE
saffron.default.collation.name=UTF-16LE$en_US

The following is the config file drill-override.conf that is used to configure Apache Drill related options and is located in $HOME/drill/conf:


drill-override.conf
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

drill.exec: {
  cluster-id: "dev-drillbit",
  zk.connect: "localhost:2181",
  sys.store.provider.local.path="/opt/drill/conf"
}

Finally, the following is the config file storage-plugins-override.conf that is used to configure the various storage options in Apache Drill and is located in $HOME/drill/conf:


storage-plugins-override.conf
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

"storage": {
  dfs: {
    type: "file",
    connection: "file:///",
    workspaces: {
      "data_pq": {
        "location": "/data/parquet/",
        "writable": false,
        "defaultInputFormat": "parquet",
        "allowAccessOutsideWorkspace": false
      },
      "data_csv": {
        "location": "/data/csv/",
        "writable": false,
        "defaultInputFormat": "csv",
        "allowAccessOutsideWorkspace": false
      }
    },
    formats: {
      "parquet": {
        "type": "parquet"
      },
      "csv": {
        "type": "text",
        "extensions": [
          "csv"
        ],
        "extractHeader": true,
        "delimiter": ","
      }
    },
    enabled: true
  }
}

For our demonstration, we will only leverage the data sets stored in csv and parquet formats in the local filesystem (the dfs storage).

We will copy few csv and parquet files into the appropriate locations in $HOME/drill/data.


!!! INFORMATIONAL !!!

The data sets used in this article are from the following two sources:

1. Palmer Penguins
2. Device Monitoring

The following illustration depicts the various data files in the different directories:


Data Files
Figure-3

Now, it is time to copy the following files into the appropriate destination as indicated below:

At this point, we are done with all the required setup.

To start Apache Drill in the embedded mode (single node mode), execute the following command:

$ docker run -it --rm --name drill -u $(id -u ${USER}):$(id -g ${USER}) -p 8047:8047 -p 31010:31010 -v /home/polarsparc/drill/conf:/opt/drill/conf -v /home/polarsparc/drill/log:/opt/drill/log -v /home/polarsparc/drill/data:/data bswamina/apache-drill:1.20.2-openjdk-11 /opt/drill/bin/drill-embedded

The following would be the typical output:

Output.4

"A Drill is a terrible thing to waste."
apache drill>

Notice the prompt changes to apache drill>.

Hands-on with Drill

To ensure Drill has started properly in the embedded mode, execute the following command:

apache drill> select version from sys.version;

The following would be the typical output:

Output.5

+---------+
| version |
+---------+
| 1.20.2  |
+---------+
1 row selected (1.326 seconds)

We will begin by accessing the penguins parquet file via Apache Drill. Parquet files are stored locally on the filesystem. This means we use the dfs storage plugin. Also, the parquet files are stored in the directory $HOME/drill/data/parquet which is defined under the workspace of data_pq. So, the schema to use is dfs.data_pq.

To switch to the schema dfs.data_pq, execute the following command:

apache drill> use dfs.data_pq;

The following would be the typical output:

Output.6

+------+-----------------------------------------+
|  ok  |                 summary                 |
+------+-----------------------------------------+
| true | Default schema changed to [dfs.data_pq] |
+------+-----------------------------------------+
1 row selected (0.249 seconds)
apache drill (dfs.data_pq)>

Notice the prompt changes to apache drill (dfs.data_pq)>.

To display all the file systems visible to drill for query purposes, execute the following command:

apache drill (dfs.data_pq)> show files;

The following would be the typical output:

Output.7

+------+-------------+--------+--------+-------+-------+-------------+-------------------------+-------------------------+
| name | isDirectory | isFile | length | owner | group | permissions |       accessTime        |    modificationTime     |
+------+-------------+--------+--------+-------+-------+-------------+-------------------------+-------------------------+
| D1   | true        | false  | 4096   | drill | drill | rwxrwxr-x   | 2022-10-14 19:16:13.325 | 2022-10-14 19:33:12.247 |
| D3   | true        | false  | 4096   | drill | drill | rwxrwxr-x   | 2022-10-14 19:18:49.013 | 2022-10-14 19:33:13.585 |
| D2   | true        | false  | 4096   | drill | drill | rwxrwxr-x   | 2022-10-14 19:21:42.013 | 2022-10-14 19:33:14.301 |
+------+-------------+--------+--------+-------+-------+-------------+-------------------------+-------------------------+
3 rows selected (0.152 seconds)

To display all the schemas available to drill for access, execute the following command:

apache drill (dfs.data_pq)> show schemas;

The following would be the typical output:

Output.8

+--------------------+
|    SCHEMA_NAME     |
+--------------------+
| dfs.data_csv       |
| dfs.data_pq        |
| dfs.default        |
| information_schema |
| sys                |
| cp.default         |
+--------------------+
6 rows selected (0.308 seconds)

To display the count of records in the penguins parquet file, execute the following command:

apache drill (dfs.data_pq)> select count(*) as `Count` from dfs.data_pq.`/D1/*.parquet`;

The following would be the typical output:

Output.9

+-------+
| Count |
+-------+
| 342   |
+-------+
1 row selected (0.777 seconds)

!!! ATTENTION !!!

There is a weird bug in Apache Drill. Trying to execute the following SQL results in an error:

select count(*) from dfs.data_pq.`/D1/penguins.parquet`;

The generated error is:

Error: VALIDATION ERROR: From line 1, column 22 to line 1, column 24: Object '/D1/penguins.parquet' not found
[Error Id: 5618fe46-4615-4169-b651-55e4b007293a ] (state=,code=0)

To display all the penguin species and their body mass from the penguins parquet file where their body mass is greater than or equal to 5750, execute the following command:

apache drill (dfs.data_pq)> select `species`, `body_mass_g` from dfs.data_pq.`/D1/*.parquet` where `body_mass_g` >= 5750 order by `body_mass_g`;

The following would be the typical output:

Output.10

+---------+-------------+
| species | body_mass_g |
+---------+-------------+
| Gentoo  | 5750.0      |
| Gentoo  | 5800.0      |
| Gentoo  | 5800.0      |
| Gentoo  | 5850.0      |
| Gentoo  | 5850.0      |
| Gentoo  | 5850.0      |
| Gentoo  | 5950.0      |
| Gentoo  | 5950.0      |
| Gentoo  | 6000.0      |
| Gentoo  | 6000.0      |
| Gentoo  | 6050.0      |
| Gentoo  | 6300.0      |
+---------+-------------+
12 rows selected (0.563 seconds)

To display the average body mass grouped by the penguin species from the penguins parquet file, execute the following command:

apache drill (dfs.data_pq)> select p.species, avg(p.body_mass_g) as `mean_body_mass_g` from dfs.data_pq.`/D1/*.parquet` as p group by p.species;

The following would be the typical output:

Output.11

+-----------+--------------------+
|  species  |  mean_body_mass_g  |
+-----------+--------------------+
| Adelie    | 3700.662251655629  |
| Gentoo    | 5076.016260162602  |
| Chinstrap | 3733.0882352941176 |
+-----------+--------------------+
3 rows selected (0.215 seconds)

Moving on to the more complex case of joining multiple sources (two parquet files from the device monitoring data set).

To display the time, the model, the cpu usage, and the memory usage on devices from the device monitoring parquet files for a limit of 10 rows, execute the following command:

apache drill (dfs.data_pq)> select `b`.`time`, `a`.`model`, `b`.`cpu_avg_5min`, `b`.`mem_used` from `/D2/*.parquet` as `a` inner join `/D3/*.parquet` as `b` on `a`.`device_id` = `b`.`device_id` limit 10;

The following would be the typical output:

Output.12

+------------------------+---------+--------------+-----------+
|          time          |  model  | cpu_avg_5min | mem_used  |
+------------------------+---------+--------------+-----------+
| 2016-11-15 07:00:00-05 | pinto   | 10.802       | 589988922 |
| 2016-11-15 07:00:00-05 | mustang | 8.106        | 279257668 |
| 2016-11-15 07:00:00-05 | focus   | 8.2          | 350418942 |
| 2016-11-15 07:00:00-05 | pinto   | 5.594        | 450422930 |
| 2016-11-15 07:00:00-05 | mustang | 5.13         | 370403299 |
| 2016-11-15 07:00:00-05 | focus   | 7.572        | 590376778 |
| 2016-11-15 07:00:00-05 | pinto   | 7.362        | 350314519 |
| 2016-11-15 07:00:00-05 | mustang | 12.4         | 590887970 |
| 2016-11-15 07:00:00-05 | focus   | 7.232        | 589266926 |
| 2016-11-15 07:00:00-05 | pinto   | 11.014       | 460684425 |
+------------------------+---------+--------------+-----------+
10 rows selected (3.179 seconds)

!!! ATTENTION !!!

There is a weird bug in Apache Drill. Trying to execute the following SQL results in an error:

select b.time, a.model, b.cpu_avg_5min, b.mem_used from `/D2/*.parquet` as a inner join `/D3/*.parquet` as b on a.device_id = b.device_id limit 10;

The generated error is:

Error: PARSE ERROR: Encountered "." at line 1, column 9.
SQL Query: select b.time, a.model, b.cpu_avg_5min, b.mem_used from `/D2/*.parquet` as a inner join `/D3/*.parquet` as b on a.device_id = b.device_id limit 10
                   ^
[Error Id: 7e5a87f7-9dbf-4faf-b291-5e86cc26c0d7 ] (state=,code=0)

It is now time to switch gears into the csv space. To switch to the schema dfs.data_csv, execute the following command:

apache drill> use dfs.data_csv;

The following would be the typical output:

Output.13

+------+------------------------------------------+
|  ok  |                 summary                  |
+------+------------------------------------------+
| true | Default schema changed to [dfs.data_csv] |
+------+------------------------------------------+
1 row selected (0.132 seconds)
apache drill (dfs.data_csv)>

Notice the prompt changes to apache drill (dfs.data_csv)>.

To display all the file systems visible to drill for query purposes, execute the following command:

apache drill (dfs.data_csv)> show files;

The following would be the typical output:

Output.14

+------+-------------+--------+--------+-------+-------+-------------+-------------------------+-------------------------+
| name | isDirectory | isFile | length | owner | group | permissions |       accessTime        |    modificationTime     |
+------+-------------+--------+--------+-------+-------+-------------+-------------------------+-------------------------+
| D1   | true        | false  | 4096   | drill | drill | rwxrwxr-x   | 2022-10-14 19:42:16.223 | 2022-10-14 19:42:16.319 |
| D2   | true        | false  | 4096   | drill | drill | rwxrwxr-x   | 2022-10-14 19:42:17.738 | 2022-10-14 19:42:17.794 |
+------+-------------+--------+--------+-------+-------+-------------+-------------------------+-------------------------+
2 rows selected (0.126 seconds)

To display the penguin species and their body mass from the penguins csv file and limit the number of records to 10, execute the following command:

apache drill (dfs.data_csv)> select `species`, `body_mass_g` from `/D1/device_info.csv` limit 10;

The following would be the typical output:

Output.15

+---------+-------------+
| species | body_mass_g |
+---------+-------------+
| Gentoo  | 5750.0      |
| Gentoo  | 5800.0      |
| Gentoo  | 5800.0      |
| Gentoo  | 5850.0      |
| Gentoo  | 5850.0      |
| Gentoo  | 5850.0      |
| Gentoo  | 5950.0      |
| Gentoo  | 5950.0      |
| Gentoo  | 6000.0      |
| Gentoo  | 6000.0      |
| Gentoo  | 6050.0      |
| Gentoo  | 6300.0      |
+---------+-------------+
12 rows selected (0.563 seconds)

!!! ATTENTION !!!

Notice how one is able to refer to the actual data file with csv file(s). This is an issue with the parquet file(s)

Now, for the final act of joining multiple sources (a csv file and a parquet file from the device monitoring data set).

To display the time, the model, the cpu usage, and the memory usage on devices from the device info parquet file and the device readings csv file for a limit of 10 rows, execute the following command:

apache drill (dfs.data_csv)> select `b`.`time`, `a`.`model`, `b`.`cpu_avg_5min`, `b`.`mem_used` from dfs.`/data/parquet/D2/device_info.parquet` as `a` inner join dfs.`/data/csv/D2/device_readings.csv` as `b` on `a`.`device_id` = `b`.`device_id` limit 10;

The following would be the typical output:

Output.16

+------------------------+---------+--------------+-----------+
|          time          |  model  | cpu_avg_5min | mem_used  |
+------------------------+---------+--------------+-----------+
| 2016-11-15 07:00:00-05 | pinto   | 10.802       | 589988922 |
| 2016-11-15 07:00:00-05 | mustang | 8.106        | 279257668 |
| 2016-11-15 07:00:00-05 | focus   | 8.2          | 350418942 |
| 2016-11-15 07:00:00-05 | pinto   | 5.594        | 450422930 |
| 2016-11-15 07:00:00-05 | mustang | 5.13         | 370403299 |
| 2016-11-15 07:00:00-05 | focus   | 7.572        | 590376778 |
| 2016-11-15 07:00:00-05 | pinto   | 7.362        | 350314519 |
| 2016-11-15 07:00:00-05 | mustang | 12.4         | 590887970 |
| 2016-11-15 07:00:00-05 | focus   | 7.232        | 589266926 |
| 2016-11-15 07:00:00-05 | pinto   | 11.014       | 460684425 |
+------------------------+---------+--------------+-----------+
10 rows selected (0.48 seconds)

One can access the web interface of Apache Drill by launching the url http://localhost:8047 in a web browser.

The following illustration depicts the Apache Drill web interface:


Web Interface
Figure-4

Clicking on the Storage option from the navigation bar will land one in the storage options page as shown in the illustration below:


Storage Options
Figure-5

Clicking on the Query option from the navigation bar will land one in the query execution page as shown in the illustration below:


Query Execution
Figure-6

To quit from the drill command-line query interface, execute the following command:

apache drill (dfs.data_csv)> !quit

The following is the link to the Github Repo that provides the config and data files used in this article:


References

Apache Drill



© PolarSPARC