Apache Spark 3 バッチジョブをオンプレミスから Google Cloud に移行したいと考えています。ジョブが Cloud Storage から読み取り、結果を BigQuery に書き込むように、ジョブを最小限に変更する必要があります。ジョブは Spark 用に最適化されており、各エグゼキュータには 8 個の vCPU と 16 GB のメモリがあり、同様の設定を選択できるようにしたいと考えています。ジョブを実行するためのインストールと管理の労力を最小限に抑えたいと考えています。どうすればよいでしょうか。
正解:B
The key requirements are:
* Migrate Spark 3 batch job.
* Minimally change the job (reads from GCS, writes to BQ - standard for Spark on GCP).
* Job optimized for Spark (specific executor vCPU/memory).
* Ability to choose similar executor settings.
* Minimize installation and management effort.
Dataproc Serverless (Option A)is designed for these use cases.
* Spark 3 Support:Dataproc Serverless supports various Spark runtimes, including Spark 3.
* Minimal Changes:Spark jobs reading from GCS and writing to BigQuery (using the Spark-BigQuery connector) are standard. Minimal code changes are generally needed.
* Customizable Resources:Dataproc Serverless allows you to specify resources for the driver and executors, including vCPU and memory. You can configure these to match your optimized on-premises settings (e.g., 8 vCPU, 16 GB memory per executor, though specific available configurations should be checked).
* Minimal Installation and Management:This is the core benefit of "serverless." You don't need to provision, manage, or scale clusters. You submit your batch job, and Google Cloud handles the underlying infrastructure. This significantly reduces operational overhead.
Let's analyze why other options are less suitable:
* B (Compute Engine VM):You would need to manually install Spark, configure it, manage dependencies, and manage the VM itself. This is high management effort.
* C (Google Kubernetes Engine cluster):While you can run Spark on GKE (e.g., using Spark on Kubernetes operator), it involves managing the GKE cluster, Spark deployment configurations, Docker images, etc. This is also significant management effort, more than Dataproc Serverless.
* D (Dataproc cluster):A standard Dataproc cluster provides more control than serverless but also requires more management (creating, scaling, and managing the cluster lifecycle). Dataproc Serverless is specifically designed to minimize this management for batch jobs. Given the "minimize installation and management effort" requirement, serverless is preferred over a managed cluster if it meets the job's needs.
Reference:
Google Cloud Documentation: Dataproc Serverless > Overview. "Dataproc Serverless for Spark lets you run Spark batch workloads without requiring you to provision and manage your own cluster... Submit your Spark workload to the Dataproc Serverless service. The service will run the workload on a managed compute infrastructure, autoscaling resources as needed." Google Cloud Documentation: Dataproc Serverless > Submitting Spark batch workloads > Spark batch workload properties. This documentation shows how you can specify properties for driver and executor cores (spark.driver.cores, spark.executor.cores) and memory (spark.driver.memory, spark.executor.memory), allowing you to choose settings similar to your existing optimized job.