[SPARK-10191] spark-ec2 cannot stop running cluster - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: EC2
Labels:
None
Environment:

AWS EC2

External issue URL:
http://stackoverflow.com/questions/31910660/spark-ec2-cannot-stop-running-cluster/31932345#31932345

Description

Using command spark-ec2, I've created a cluster with name "ruofan-large-cluster" within a virtual private cloud (vpc) on AWS-EC2. The cluster contains one master and two slave nodes, and it works very well. Now I would like to stop the cluster for a while, and then restart it. However, when I type the bash command as follow:

$ ./spark-ec2 --region=us-east-1 stop ruofan-large-cluster

It showed up the following output:

Are you sure you want to stop the cluster ruofan-large-cluster?
DATA ON EPHEMERAL DISKS WILL BE LOST, BUT THE CLUSTER WILL KEEP USING SPACE ON
AMAZON EBS IF IT IS EBS-BACKED!!
All data on spot-instance slaves will be lost.
Stop cluster ruofan-large-cluster (y/N): y
Searching for existing cluster ruofan-large-cluster in region us-east-1...
Stopping master...
Stopping slaves...

It didn't stop the cluster at all... I'm sure the information including both my cluster name and cluster region are both correct, and I also tried the following command to stop the cluster:

$ ./spark-ec2 -k <key-file-name> -i <key-file> -r us-east-1 --vpc-id=<my-vpc-id> --subnet-id=<my-subnet-id> stop ruofan-large-cluster

It still showed the same output, and it didn't stop any cluster. So I spent several hours on this problem, and I think the official Spark code spark-ec2.py may have a bug for identifying cluster name so I can't stop clusters. I am using spark-1.4.0, and in most of cases, spark-ec2.py works very well if I directly launch clusters on AWS without subnet of vpc. However, if I launch my cluster on a subnet of a vpc on AWS, then spark-ec2.py is unable to find the cluster so I can't stop the cluster. Specifically, in spark-ec2.py, there is a small segment of code as below:

conn = ec2.connect_to_region(opts.region)

Whenever we do actions such as launch, login, stop or destroy the cluster, spark-ec2 will firstly connect to the specified region using the above code, and then gets all satisfied instances by reservations =conn.get_all_reservations(filter={some conditions}). It works very well if I directly launch my cluster without subnet of vpc. If my cluster is in the subnet of vpc, then conn.get_all_reservations() gets nothing. Just now I modified the original code with `conn = ec2.connect_to_region(opts.region, aws_access_key_id="my_aws_access_key_id", aws_secret_access_key="my_aws_secret_access_key"), and everything such as stop, login, destroy, etc. works perfect. I am wondering if we can do some changes on the spark code.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ruofan Kong

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Aug/15 20:38

Updated:: 09/Feb/16 09:12

Resolved:: 09/Feb/16 09:12