Google Cloud 上に分析環境を作成し、データ サイエンティスト チームがオンプレミスの Apache Hadoop ソリューションに影響を与えずにデータを探索できるようにしました。オンプレミスの Hadoop 分散ファイル システム (HDFS) クラスタ内のデータは、Hive パーティション分割の複数の列を含む最適化された行列 (ORC) 形式のファイルにあります。データ サイエンティスト チームは、オンプレミスの HDFS クラスタで Hive クエリ エンジンの SQL を使用してデータを探査するのと同じ方法でデータを探索できる必要があります。最も費用対効果の高いストレージおよび処理ソリューションを選択する必要があります。どうすればよいでしょうか。
正解:D
The requirements are:
* Explore ORC formatted files with Hive partitioning.
* Mimic the SQL on Hive query engine experience.
* Cost-effective storage and processing.
* Avoid impacting the on-premises Hadoop solution.
Let's analyze the options:
* Option A (Import to Bigtable):Bigtable is a NoSQL database, not suited for SQL-based exploration of ORC files or Hive-style partitioning directly. This would require significant data transformation and a different query paradigm. Not cost-effective for this use case.
* Option B (Import to BigQuery native tables):Importing data into BigQuery native storage is an option. BigQuery can load ORC files. This provides excellent query performance. However, it involves an ETL step (importing) and storage costs for the datawithin BigQuery, which might be higher than keeping it in its original format on Cloud Storage if query patterns are exploratory and not extremely frequent on all data.
* Option C (Copy to Cloud Storage, deploy Dataproc):Dataproc allows you to run Hadoop/Spark (and thus Hive) clusters on Google Cloud. This would provide a very similar experience ("SQL on the Hive query engine"). However, running a persistent Dataproc cluster incurs compute costs for the cluster nodes, even when not actively querying. While ephemeral clusters are possible, it adds operational overhead for exploratory queries. Storage on Cloud Storage is cost-effective.
* Option D (Copy to Cloud Storage, create external BigQuery tables):This is often the most cost- effective and straightforward solution for this scenario.
* Cost-effective Storage:Cloud Storage is a low-cost option for storing files like ORC.
* SQL Interface:BigQuery provides a familiar SQL interface.
* External Tables:BigQuery can query data directly from Cloud Storage (including ORC files) using external tables. This avoids the need to load data into BigQuery's managed storage, saving on storage costs and ETL effort.
* Hive Partitioning:BigQuery external tables support Hive partitioning layouts. When you define the external table, you can specify the partitioning scheme, and BigQuery will use partition pruning to scan only relevant partitions, improving performance and reducing costs for queries that filter on partition keys. This directly mimics the Hive experience.
* Processing Cost:You only pay for the data scanned by BigQuery queries, which aligns with exploratory analysis.
Comparing D with B: External tables are generally more cost-effective for storage and initial setup if the data is already in ORC and an ETL process into BigQuery native storage is to be avoided. Query performance might be slightly less than native tables but is often sufficient for exploration, especially with partitioning.
Comparing D with C: BigQuery external tables are serverless, meaning no cluster to manage or pay for when idle. Dataproc requires managing and paying for a cluster. For exploration, the serverless nature of BigQuery is usually more cost-effective.
Therefore, copying ORC files to Cloud Storage and using BigQuery external tables is the most cost-effective solution that meets all requirements.
Reference:
Google Cloud Documentation: BigQuery > External data sources > Querying Cloud Storage data. "You can query data in Cloud Storage by using external tables or federated queries. External tables are tables that read data directly from files in Cloud Storage." Google Cloud Documentation: BigQuery > External data sources > Supported formats and compression types. ORC is a supported format.
Google Cloud Documentation: BigQuery > Creating and using tables > Creating external tables. "External tables let you query data stored in Cloud Storage as if it were a standardBigQuery table. You can use external tables to query data in various formats, including... ORC..." Google Cloud Documentation: BigQuery > Creating and using tables > Querying partitioned external tables.
"You can create an external table that is partitioned on Hive partitioning keys. When you query a Hive partitioned external table, BigQuery performs partition pruning to skip reading unnecessary partitions." This directly addresses the "Hive partitioning" and "explore data in a similar way" requirements.
Google Cloud Blog: "Choosing the right data processing option on GCP: BigQuery vs. Dataproc" (and similar articles) often highlight BigQuery external tables as a cost-effective way to query data in place on Cloud Storage, especially for data lake scenarios.