amazon web services: Giving Comomn crawl location as input to Amazon EMR using mrjob python

dimanche 27 septembre 2015

Giving Comomn crawl location as input to Amazon EMR using mrjob python

It has been only days since I started using mrjob and I have tried certain low and medium level tasks.Now I am stuck at giving Common crawl [now onwards will be know as CC] location as input to emr using python mrjob

My config file looks like this :

runners:
  emr:
    aws_access_key_id: <AWS Access Key>
    aws_secret_access_key: <AWS Secret Access Key>
    aws_region: us-east-1
    ec2_key_pair: cslab
    ec2_key_pair_file: ~/cslab.pem
    ec2_instance_type: m1.small
    num_ec2_instances: 5
  local:
    base_tmp_dir: /tmp

Big thing small :I am trying to get the number of words in a web page of a site

Big thing big: Is my code below

My Code:

import warc

class MRcount(MRJob):
    # ...

    def mapper(self, _, s3_path):
        s3_url_parsed = urlparse.urlparse(s3_url)
        bucket_name = s3_url_parsed.netloc
        key_path = s3_url_parsed.path[1:]
        conn = boto.connect_s3()
        bucket = conn.get_bucket('aws-publicdatasets', validate=False)
        key = Key(bucket, s3_path)
        webpage_text = record.payload.read()
        yield record.header['warc-target-uri'],len(webpage_text.split()
if __name__ == '__main__':
    MRcount.run())

Everything is good till now but when I try to run it .

Cmd:

$ python mr_crawl.py -r emr s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/wet.paths.gz

Error:

boto.exception.S3ResponseError: S3ResponseError: 301 Moved Permanently
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message <RequestId>06660583263444FC</RequestId><Bucket>smarkets-db</Bucket><HostId>TCZJTKZ8wo8V1h0xjkOI6grojs/r9IBkhMOcvolXv06QEtxTX89M55aLTPGOo/ht</HostId><Endpoint>eu-west-bucket.s3.amazonaws.com</Endpoint></Error>

I thought it was because of the region in my config file and removed it but I get a new error

My new config file:

runners:
  emr:
    aws_access_key_id: <AWS Access Key>
    aws_secret_access_key: <AWS Secret Access Key>
    ec2_key_pair: cslab
    ec2_key_pair_file: ~/cslab.pem
    ec2_instance_type: m1.small
    num_ec2_instances: 5
  local:
    base_tmp_dir: /tmp

I get the following error SSH error:

using configs in /etc/mrjob.conf
using existing scratch bucket mrjob-4db6342a70e021ad
using s3://mrjob-4db6342a70e021ad/tmp/ as our scratch dir on S3
creating tmp directory /tmp/word_count.20140603.181541.006786
writing master bootstrap script to /tmp/word_count.20140603.181541.006786/b.py
Copying non-input files into s3://mrjob-4db6342a70e021ad/tmp/word_count.matthew.20140603.181541.006786/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-3DCN7LULSRILW
Created new job flow j-3DCN7LULSRILW
Job on job flow j-3DCN7LULSRILW failed with status FAILED: The given SSH key name was invalid
Logs are in s3://mrjob-4db6342a70e021ad/tmp/logs/j-3DCN7LULSRILW/
Scanning S3 logs for probable cause of failure
Waiting 5.0s for S3 eventual consistency
Terminating job flow: j-3DCN7LULSRILW
Traceback (most recent call last):
  File "word_count.py", line 16, in <module>
    MRcount.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 494, in run
    mr_job.execute()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 512, in execute
    super(MRJob, self).execute()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 147, in execute
    self.run_job()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 208, in run_job
    runner.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 458, in run
    self._run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 809, in _run
    self._wait_for_job_to_complete()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1599, in _wait_for_job_to_complete
    raise Exception(msg)
Exception: Job on job flow j-3DCN7LULSRILW failed with status FAILED: The given SSH key name was invalid

Thanks ,

amazon web services

dimanche 27 septembre 2015

Giving Comomn crawl location as input to Amazon EMR using mrjob python

Aucun commentaire:

Enregistrer un commentaire