Monday, June 6, 2016

Introduction To R Language

INTRODUCTION TO R

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formula where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and Mac OS.

The R environment

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes

  • an effective data handling and storage facility,
  • a suite of operators for calculations on arrays, in particular matrices,
  • a large, coherent, integrated collection of intermediate tools for data analysis,
  • graphical facilities for data analysis and display either on-screen or on hard copy, and
  • a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.
Many users think of R as a statistics system. We prefer to think of it of an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.
R has its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both on-line in a number of formats and in hard copy.


Every data analysis technique at your fingertips : R includes virtually every data manipulation, statistical model, and chart that the modern data scientist could ever need. You can easily find, download and use cutting-edge community-reviewed methods in statistics and predictive modeling from leading researchers in data science, free of charge.


Create beautiful and unique data visualizations : Representing complex data with charts and graphs is an essential part of the data analysis process, and R goes far beyond the traditional bar chart and line plot. Heavily influenced by thought leaders in data visualization like Bill Cleveland and Edward Tufte, R makes it easy to draw meaning from multidimensional data with multi-panel charts, 3-D surfaces and more. The custom charting capabilities of R are featured in many of the stunning infographics seen in the New York Times, The Economist, and the Flowing Data blog.

Get better results faster : Instead of using point-and-click menus or inflexible "black-box" procedures, R is a programming language designed expressly for data analysis. Intermediate level R programmers create data analyses faster than users of legacy statistical software, with the flexibility to mix-and-match models for the best results. And R scripts are easily automated, promoting both reproducible research and production deployments.

Draw on the talents of data scientists worldwide : As a thriving open-source project, R is supported by a community of more than 2 million users and thousands of developers worldwide. Whether you're using R to optimize portfolios, analyze genomic sequences, or to predict component failure times, experts in every domain have made resources, applications and code available for free, online.






Monday, May 30, 2016

Top IOT Platforms

Top IOT Platforms


 

 

What is an IOT Platform ?

In simple words the purpose of any IoT device is to connect with other IoT devices and applications (cloud-based mostly) to relay information using internet transfer protocols.
The gap between the device sensors and data networks is filled by an IoT Platform. Such a platform connects the data network to the sensor arrangement and provides insights using backend applications to make sense of plethora of data generated by hundreds of sensors.
In light of the possibilities that internet of things is offering tech companies has started capitalizing it. There are many IoT platforms available now that provide option to deploy internet of things applications on the go.
While there are hundreds of companies and a few startups venturing into IoT platform development, players like Amazon and Microsoft are way ahead of others in the competition. Read on to know about top 10 IoT platforms you can use for your applications.


1) Amazon Web Services (AWS) IOT 

Last year Amazon announced the AWS IoT platform at it s Re:Invent conference. Main features of AWS IoT platform are:
  • Registry for recognizing devices
  • Software Development Kit for devices
  • Device Shadows
  • Secure Device Gateway
  • Rules engine for inbound message evaluation
According to Amazon, their IoT platform will make it a lot easier for developers to connect sensors for multiple applications ranging from automobiles to turbines to smart home light bulbs.
Taking the scope of AWS IoT to the next level, the vendor has partnered with hardware manufacturers like Intel, Texas Instruments, Broadcom and Qualcomm to create starter kits compatible with their platform.


2) Microsoft Azure IOT
 



Microsoft is very much interested in bringing up products for internet of things. For the initiative the Microsoft Azure cloud services compatible IoT platform, the Azure IoT suite is on the offer. Features included in this platform are:

  • Device shadowing
  • A rules engine
  • Identity registry
  • Information monitoring
For processing the massive amount of information generated by sensors Azure IoT suite comes with Azure Stream Analytics to process massive amounts of information in real-time.

3) ThingWorx IOT Platform
 

In vendor’s own words, “ThingWorx is the industry’s leading Internet of Things (IoT) technology platform. It enables innovators to rapidly create and deploy game-changing applications, solutions and experiences for today’s smart, connected world.
Thingsworx is an IoT platform which is designed for enterprise application development. It offers features like:

  • Easy connectivity of devices to the platform
  • Remove complexity from IoT application development
  • Sharing platform among developers for rapid development
  • Integrated machine learning for automating complex big data analytics
  • Deploy cloud, embedded or on-premise IoT solutions on the go.

4) IBM Watson



We can never expect the Big Blue to miss on the opportunity to making a mark in the internet of things segment. IBM Watson is an IoT platform which is pretty much taken among developers already. Backed by IBM’s hybrid cloud PaaS (platform as a service) development platform, the Bluemix, Watson IoT enables developers to easily deploy IoT applications.
Users of IBM Watson get:

  • Device Management
  • Secure Communications
  • Real Time Data Exchange
  • Data Storage
  • Recently added data sensor and weather data service.

5) Cisco IOT Cloud



Cisco’s IoT platform is for mobile operators offering:

  • Voice and data connectivity
  • SIM lifecycle management
  • IP session control
  • Customizable billing and reporting
Lately Cisco also partnered with National Farmers’ Federation in Australia to provide internet of things in agriculture solutions.


6) Salesforce IOT Cloud

Salesforce’s IoT platform is powered by Thunder and is focused entirely on customer engagement. According to the official announcement the platform following features:
  • Sales-Creating sales orders and capturing potential opportunities
  • Services-Request and order repairs automatically
  • Marketing-Notifies customers through texts directly on their devices
  • Apps-Automatic inspections of inventory. 
 

7) Carriots Cloud


Helping to build and host internet of things applications, Carriots is a PaaS platform. It is slowly gaining popularity among the users due to its ease of integration with other applications. It also widely used for Machine to Machine development.
It offers following functionalities:
  • Device management
  • SDK application engine
  • Debug and logs
  • API key management
  • Data export feature
  • Custom alarms
  • Customer hierarchy levels
  • User management
  • Custom control panel

 

8) Oracle Integrated Cloud

Offering big data analysis for real-time IoT data, device virtualization, endpoint management and high speed messaging, Oracle’s Integrated Cloud platform is offering an analytics platform for IoT applications. Users can receive the notifications directly on their devices.

According to the vendor their IoT platform is:
  • Faster to market
  • Integrated
  • Real time insight
  • Secure and scalable for real-time IoT data, device virtualization, endpoint management and high speed messaging, Oracle’s Integrated Cloud platform is offering an analytics platform for IoT applications. Users can receive the notifications directly on their devices.

According to the vendor their IoT platform is:
  • Faster to market
  • Integrated
  • Real time insight
  • Secure and scalable









 

Tuesday, May 24, 2016

Hadoop Interview Questions and Answers

Top Hadoop Interview Questions


hadoop interview questions






1.What are real-time industry applications of Hadoop?
Hadoop, well known as Apache Hadoop, is an open-source software platform for scalable and distributed computing of large volumes of data. It provides rapid, high performance and cost-effective analysis of structured and unstructured data generated on digital platforms and within the enterprise. It is used in almost all departments and sectors today:
• Managing traffic on streets
• Streaming processing
• Content Management and Archiving Emails
• Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster
• Fraud detection and Prevention
• Advertisements Targeting Platforms are using Hadoop to capture and analyze click stream, transaction, video and social media data
• Managing content, posts, images and videos on social media platforms
• Analyzing customer data in real-time for improving business performance
• Public sector fields such as intelligence, defense, cyber security and scientific research
• Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns, identify rogue traders, more precisely target their marketing campaigns based on customer segmentation, and improve customer satisfaction
• Getting access to unstructured data like output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data
2.How is Hadoop different from other parallel computing systems?
Hadoop is a distributed file system, which lets you store and handle massive amount of data on a cloud of machines, handling data redundancy. The primary benefit is that since data is stored in several nodes, it is better to process it in distributed manner. Each node can process the data stored on it instead of spending time in moving it over the network.
On the contrary, in Relational database computing system, you can query data in real-time, but it is not efficient to store data in tables, records and columns when the data is huge.
Hadoop also provides a scheme to build a Column Database with Hadoop HBase, for runtime queries on rows.
3.What all modes Hadoop can be run in?
Hadoop can run in three modes:
1. Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
2. Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
3. Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.
4.Explain the major difference between HDFS block and InputSplit.
In simple terms, block is the physical representation of data while split is the logical representation of data present in the block. Split acts a s an intermediary between block and mapper. Suppose we have two blocks:
Block 1: yy tytry
Block 2: xx yuyy

Now, considering the map, it will read first block from ii till ll, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block. It then forms key-value pair using inputformat and records reader and sends map for further processing With inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.
However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.
5.What is distributed cache and what are its benefits?
Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when needed. Once a file is cached for a specific job, hadoop will make it available on each data node both in system and in memory, where map and reduce tasks are executing.Later, you can easily access and read the cache file and populate any collection (like array, hashmap) in your code.
Benefits of using distributed cache are:
• It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.
• Distributed cache tracks the modification timestamps of cache files, which notifies that the files should not be modified until a job is executing currentluy.
6.Explain the difference between NameNode, Checkpoint NameNode and BackupNode.
NameNode is the core of HDFS that manages the metadata – the information of what file maps to what block locations and what blocks are stored on what datanode. In simple terms, it’s the data about the data being stored. NameNode supports a directory tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It uses following files for namespace:
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NameNode has the same directory structure as NameNode, and creates checkpoints for namespace at regular intervals by downloading the fsimage and edits file and margining them within the local directory. The new image after merging is then uploaded to NameNode.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not support the ‘upload to NameNode’ functionality.
Backup Node provides similar functionality as Checkpoint, enforcing synchronization with NameNode. It maintains an up-to-date in-memory copy of file system namespace and doesn’t require getting hold of changes after regular intervals. The backup node needs to save the current state in-memory to an image file to create a new checkpoint.
7.What are the most common Input Formats in Hadoop?
There are three most common input formats in Hadoop:
• Text Input Format: Default input format in Hadoop.
• Key Value Input Format: used for plain text files where the files are broken into lines
• Sequence File Input Format: used for reading files in sequence
8.Define DataNode and how does NameNode tackle DataNode failures?
DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive a message from datanode for 10 minutes, it considers it to be dead or out of place, and starts replication of blocks that were hosted on that data node such that they are hosted on some other data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to replicate what were stored in dead DataNode.
The NameNode manages the replication of data blocksfrom one DataNode to other. In this process, the replication data transfers directly between DataNode such that the data never passes the NameNode.
9.What are the core methods of a Reducer?
The three core methods of a Reducer are:
1. setup(): this method is used for configuring various parameters like input data size, distributed cache.
public void setup (context)
2. reduce(): heart of the reducer always called once per key with the associated reduced task
public void reduce(Key, Value, context)
3. cleanup(): this method is called to clean temporary files, only once at the end of the task
public void cleanup (context)
10.What is SequenceFile in Hadoop?
Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are:
1. Uncompressed key/value records.
2. Record compressed key/value records – only ‘values’ are compressed here.
3. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.
11.What is Job Tracker role in Hadoop?
Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance).
• It is a process that runs on a separate node, not on a DataNode often
• Job Tracker communicates with the NameNode to identify data location
• Finds the best Task Tracker Nodes to execute tasks on given nodes
• Monitors individual Task Trackers and submits the overall job back to the client.
• It tracks the execution of MapReduce workloads local to the slave node.
12.What is the use of RecordReader in Hadoop?
Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single record. For instance, if our input data is split like:
Row1: Welcome to
Row2: Datanalytix Blog
It will be read as “Welcome to Intellipaat” using RecordReader.
13.What is Speculative Execution in Hadoop?
One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that few slow nodes limit the rest of the program. Tehre are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution.
It creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.
Speculative execution is by default true in Hadoop. To disable, set mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution
JobConf options to false.
14.What happens if you try to run a Hadoop job with an output directory that is already present?
It will throw an exception saying that the output file directory already exists. To run the MapReduce job, you need to ensure that the output directory does not exist before in the HDFS.
To delete the directory before running the job, you can use shell:
Hadoop fs –rmr /path/to/your/output/
Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);
15.How can you debug Hadoop code?
First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.
1. Run: “ps –ef | grep –I ResourceManager”
and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.
2. On the basis of RM logs, identify the worker node that was involved in execution of the task.
3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
4. Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.
16.How to configure Replication Factor in HDFS?
hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
You can also modify the replication factor on a per-file basis using the Hadoop FS Shell:
[training@localhost ~]$ hadoopfs –setrep –w 3 /my/file
Conversely, you can also change the replication factor of all the files under a directory.
[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir
17.How to compress mapper output but not the reducer output?
To achieve this compression, you should set:
conf.set(“mapreduce.map.output.compress”, true)
conf.set(“mapreduce.output.fileoutputformat.compress”, false)
18.What is the difference between Map Side join and Reduce Side Join?
Map side Join at map side is performed data reaches the map. You need a strict structure for defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than map side join since the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.
19.How can you transfer data from Hive to HDFS?
By writing the query: hive> insert overwrite directory ‘/’ select * from emp;
You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.
20.What companies use Hadoop, any idea?
Yahoo! (the biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop, Facebook – Developed Hive for analysis , Amazon, Netflix, Adobe, eBay, Spotify, Twitter, Adobe



Monday, May 23, 2016

Spark Interview Questions

Interview Questions For Spark






1.What is Apache Spark?
Spark is a fast, easy-to-use and flexible data processing framework. It has an advanced execution engine supporting cyclic data  flow and in-memory computing. Spark can run on Hadoop, standalone or in the cloud and is capable of accessing diverse data sources including HDFS, HBase, Cassandra and others.
2.Explain key features of Spark.
• Allows Integration with Hadoop and files included in HDFS.
• Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter
• Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
• Spark supports multiple analytic tools that are used for interactive query analysis , real-time analysis and graph processing
3.Define RDD.
RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and distributed. There are primarily two types of RDD:
• Parallelized Collections : The existing RDD’s running parallel with one another
• Hadoop datasets: perform function on each file record in HDFS or other storage system
4.What does a Spark Engine do?
Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.
5.Define Partitions?
As the name suggests, partition is a smaller and logical division of data  similar to ‘split’ in MapReduce. Partitioning is the process to derive logical units of data to speed up the processing process. Everything in Spark is a partitioned RDD.
6.What operations RDD support?
• Transformations
• Actions
7.What do you understand by Transformations in Spark?
Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action occurs. map() and filer() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements form current RDD that pass function argument.
8. Define Actions.
An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations. reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to local node.
9.Define functions of SparkCore.
Serving as the base engine, SparkCore performs various important functions like memory management, monitoring jobs, fault-tolerance, job scheduling and interaction with storage systems.
10.What is RDD Lineage?
Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.
11.What is Spark Driver?
Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, driver in Spark creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
12.What is Hive on Spark?
Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.
13.Name commonly-used Spark Ecosystems.
• Spark SQL (Shark)- for developers
• Spark Streaming for processing live data streams
• GraphX for generating and computing graphs
• MLlib (Machine Learning Algorithms)
• SparkR to promote R Programming in Spark engine.
14.Define Spark Streaming.
Spark supports stream processing – an extension to the Spark API , allowing stream processing of live data streams. The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.
15.What is GraphX?
Spark uses GraphX for graph processing to build and transform interactive graphs. The GraphX component enables programmers to reason about structured data at scale.
16.What does MLlib do?
MLlib is scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.
17.What is Spark SQL?
SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured data and perform structured data processing. Through this module, Spark executes relational SQL queries on the data. The core of the component supports an altogether different RDD called SchemaRDD, composed of rows objects and schema objects defining data type of each column in the row. It is similar to a table in relational database.
18.What is a Parquet file?
Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.
19.What file systems Spark support?
• Hadoop Distributed File System (HDFS)
• Local File system
• S3
20.What is Yarn?
Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster . Running Spark on Yarn necessitates a binary distribution of Spar as built on Yarn support.
21.List the functions of Spark SQL.
Spark SQL is capable of:
• Loading data from a variety of structured sources
• Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau
• Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more
22.What are benefits of Spark over MapReduce?
• Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.
• Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
• Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
• Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
23.Is there any benefit of learning MapReduce, then?
Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.
24.What is Spark Executor?
When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.
25.Name types of Cluster Managers in Spark.
The Spark framework supports three major types of Cluster Managers:
• Standalone: a basic manager to set up a cluster
• Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications
• Yarn: responsible for resource management in Hadoop
26.What do you understand by worker node?
Worker node refers to any node that can run the application code in a cluster.
27.What is PageRank?
A unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For instance, an edge from u to v represents endorsement of v’s importance by u. In simple terms, if a user at Instagram is followed massively, it will rank high on that platform.
28.Do you need to install Spark on all nodes of Yarn cluster while running Spark on Yarn?
No because Spark runs on top of Yarn.
29.Illustrate some demerits of using Spark.
Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems. Developers need to be careful while running their applications in Spark. Instead of running everything on a single node, the work must be distributed over multiple clusters. 
30.How to create RDD?
Spark provides two methods to create RDD:
• By parallelizing a collection in your Driver program. This makes use of SparkContext’s ‘parallelize’ method
val Data = Array(2,4,6,8,10)
val distData = sc.parallelize(Data)
• By loading an external dataset from external storage like HDFS, HBase, shared file system