Spark Fundamentals II Quiz Answers

Get Spark Fundamentals II Quiz Answers

Expand your knowledge of the concepts discussed in Spark Fundamentals I with a focus on RDDs (Resilient Distributed Datasets). RDDs are the main abstraction Spark provides to enable parallel processing across the nodes of a Spark cluster.

  • Get in-deptth knowledge on Spark’s architecture and how data is distributed and tasks are parallelized.
  • Learn how to optimize your data for joins using Spark’s memory caching.
  • Learn how to use the more advanced operations available in the API.
  • The lab exercises for this course are performed exclusively on the Cloud and using a Notebook interface.

IBM Data Science Experience provides you with Jupyter notebooks that is already connected to Spark and supports Python, R, and Scala so that you start creating your Spark projects and collaborating with other data scientists. When you sign up, you get free access to Data Science Experience and all other IBM services for 30 days. Start now and take advantage of this offer.

Enroll on Cognitive Class

Module 1: Introduction to Notebooks

Question: Which of the following statements about Zeppelin Notebook is NOT true?

  • Zeppelin is open-source.
  • With Zeppelin, you can run code and create visualizations through a web interface.
  • Zeppelin comes configured with Scala, Spark, and Julia.
  • Zeppelin is an interactive data analytics tool started by NFLabs.

Question: Jupyter Notebook and Data Scientist Workbench are both open-source projects. True or false?

  • False
  • True

Question: Which notebook will you use in the lab section of this course?

  • DataBricks
  • Zeppelin
  • Watson Studio
  • Jupyter Notebook

Module 2: RDD Architecture

Question: Which of the following statements is NOT true?

  • Partitioning is what enables parallel execution of Spark jobs.
  • An RDD is made up of multiple partitions.
  • Spark normally determines the number of partitions based on the size of the hard drives in your cluster.
  • Spark is able to read from many different data stores in addition to HDFS, including the local file system and cloud services like Cloudant, AWS, Google, and Azure.

Question: In the example of an RDD with 3 partitions and no partitioner, which of the following statements is true?

  • It is better not to partition an RDD if you need to join it multiple times.
  • Joining RDDs with no partitioner will cause each executor to shuffle all values with the same key to a single machine
  • Repeatedly joining on the same RDD is highly efficient.
  • The keys are co-located.

Question: Speculative execution handles slow tasks by re-launching them as necessary. True or false?

  • False
  • True

Module 3: Optimizing Transformations and Actions

Question: Which of the following statements is true?

  • MapValues applies a map function to each value and performs repartitioning.
  • GroupByKey groups all values by key from all partitions into memory.
  • GroupByKey shuffles everything and it operates efficiently on large datasets.
  • CountByKey is designed to be used in production.

Question: AggregateByKey is better than GroupByKey when we want to calculate the average value for each key in an RDD. True or false?

  • False
  • True

Question: Which of the following statements is NOT true?

  • MapValues tells Spark that the hashed keys will remain in their partitions and we can keep the same partitioner across operations
  • In the example of a pair RDD with 2 partitions, running a map operation over all records will leave the keys of each record unchanged.
  • AggregateByKey splits the calculation into two steps. Only one pair per key, per partition is shuffled.
  • GroupByKey causes a shuffle of all values across the network, even if they are already co-located within a partition.

Module 4: Caching and Serialization

Question: Which of the following statements is true?

  • When you no longer need the persisted RDD, Spark will automatically make room for new RDDs.
  • Persisting to disk would allow us to reconstitute the RDD in the event a partition is lost, instead of re-computing all the expensive operations for the lost partitions.
  • Ideally we want to persist before any pruning, filtering, or other transformations needed for downstream processing.
  • Persisting RDDs can help us save time re-computing partitions, and persistence is in-memory only.

Question: Which of the following statements is NOT true?

  • Serialization has the added benefit of helping with garbage collection, as you’ll be storing 1 object versus many small objects for each record.
  • The records of an RDD will be stored as one large byte array.
  • There is almost no CPU usage to deserialize the data.
  • Serialization helps by saving space that persisting RDDs occupy in memory.

Question: The Java serializer can store the entire RDD in less space than the original file. True or false?

  • False
  • True

Module 5: Develop and Testing

Question: Which of the following statements is true?

  • We cannot use sbt for an Eclipse project.
  • We cannot create builds directly from the console using sbt.
  • sbt automatically finds source and library files using a conventional directory structure.
  • Maven is more powerful and customizable than sbt.

Question: IntelliJ fully supports sbt build files with no conversions required. True or false?

  • False
  • True

Question: Which of the following statements is NOT correct during unit testing?

  • The spark-testing-base package is handy for testing.
  • We want to test the code that is actually used in our application.
  • We should not use unit testing tools like scalatest.
  • We should put transformations for a given RDD in its own object or class.

Final Exam

Question: Which of the following web-based notebooks is built around Jupyter and iPython?

  • Data Scientist Workbench
  • Spark Notebook
  • Databricks Cloud
  • Zeppelin

Question: What defines a stage boundary?

  • Repartition
  • Action
  • Transformation
  • Shuffle dependency

Question: What does RDD stand for?

  • Resilient Distributed Dataset
  • Reusable Distributed Dataset
  • Reusable Data Directory
  • None of the above

Question: Coalesce can reduce the number of partitions without causing a shuffle. True or false?

  • True
  • False

Question: Which operation should you use to map the values in a pair RDD without affecting the keys or partitions?

  • map
  • mapValues
  • map or mapValues
  • You cannot map a pair RDD without affecting the keys or partitions.

Question: How can you view the lineage of an RDD?

  • showLineage()
  • toDebugString()
  • printHistory()
  • printGraph()

Question: Adding a key to an RDD will automatically repartition it so that the keys are co-located. True or false?

  • True
  • False

Question: How can you reference an external class in a closure without serializing it?

  • define it as transient
  • define it as lazy
  • Both of the above
  • None of the above

Question: What does Spark do during speculative execution?

  • Spark looks for tasks it expects to be short and runs them first
  • Spark dynamically allocates more resources to large tasks
  • Spark identifies slow-running tasks and restarts them
  • None of the above

Question: What does the following code do?

val text = sc.textFile(“SomeText.txt”)

val counts = text.flatMap(_.split(” “)).map((_, 1)).reduceByKey(_ + _).collectAsMap()

  • Counts the total number of words in the document
  • Counts the number of distinct words in the document
  • Maps every word in the document to the number of times it occurs
  • None of the above

Question: Which operation has the highest chance of causing out-of-memory errors if the dataset is really large?

  • countByValue
  • groupByKey
  • reduceByKey
  • map

Question: What is the result of this code?

val pairs = sc.parallelize(List((“a”, 1), (“a”, 5), (“b”, 6), (“b”, 3), (“c”, 2)))

val results = pairs.reduceByKey((a, b) => {

a > b match {

case true => a

case false => b

}

}).collectAsMap()

  • (“a” -> 5, “b “ -> 6, “c” -> 2)
  • (“a” -> 6, “b “ -> 9, “c” -> 2)
  • (5, 6, 2)
  • None of the above.

Question: You can execute asynchronous actions with the default FIFO scheduler. True or false?

  • True
  • False

Question: Which of the following statements about broadcast variables is true?

  • They are read-only
  • They can eliminate shuffles
  • They are shared between workers via the peer-to-peer protocol
  • All of the above
  • None of the above

Question: With the MEMORY_ONLY storage level, what happens when an RDD can’t fit in memory?

  • Spark will automatically change the storage level to MEMORY_AND_DISK
  • Some of the partitions will not be cached
  • Some of the partitions will be spilled to disk
  • Spark will throw an OOM error
  • None of the above

Question: How can you reduce the amount of memory used by persisted RDDs?

  • Use primitive types instead of Java or Scala collections and nested classes
  • Enable compression
  • Use Kryo serialization instead of Java
  • All of the above
  • None of the above

Question: Which point in an RDD lineage is the best to persist?

  • Before a reduceByKey operation
  • After outputting to disk
  • After a lot of transformations for downstream computations, such as filtering or joining
  • At the root RDD
  • None of the above

Question: A pool can have its own scheduler. True or false?

  • True
  • False

Question: In the event of a failure, how can Spark recover a lost partition?

  • Find the last good state in the RDD lineage and recompute the lost partition.
  • Restart from the root RDD.
  • Find the last good state in the RDD lineage and recompute every task.
  • Spark’s fail-safes ensure that failures will never occur.
  • None of the above.

Question: Which of the following IDEs fully supports SBT?

  • Eclipse
  • IntelliJ
  • Both Eclipse and IntelliJ
  • None of the above

Conclusion:

We hope you know the correct answers to Spark Fundamentals II If Queslers helped you to find out the correct answers then make sure to bookmark our site for more Course Quiz Answers.

If the options are not the same then make sure to let us know by leaving it in the comments below.

Course Review:

In our experience, we suggest you enroll in this and gain some new skills from Professionals completely free and we assure you will be worth it.

This course is available on Cognitive Class for free, if you are stuck anywhere between quiz or graded assessment quiz, just visit Queslers to get all Quiz Answers and Coding Solutions.

More Courses Quiz Answers >>

Building Cloud Native and Multicloud Applications Quiz Answers

Accelerating Deep Learning with GPUs Quiz Answers

Blockchain Essentials Cognitive Class Quiz Answers

Deep Learning Fundamentals Cognitive Class Quiz Answers

Hadoop 101 Cognitive Class Answers

Machine Learning With R Cognitive Class Answers

Machine Learning with Python Cognitive Class Answers

Leave a Reply

Your email address will not be published.