Thursday, August 28, 2014

Hadoop Cluster with Virtual Box


Introduction


 

The prerequisites for this tutorial is that you should have two virtual boxes with Linux/Centos Installed.

Cloudera Manager automates the installation of databases for Cloudera Manager and its services, the packages for Cloudera Manager and the Oracle JDK, and the parcels or packages for CDH and the other optional Cloudera products. This option is available if your cluster deployment meets the following requirements:

  • Uniform SSH access to cluster hosts on the same port from Cloudera Manager Server host.
  • All hosts must have access to standard package repositories.
  • All hosts must have access to either archive.cloudera.com on the internet or to a local repository with the necessary installation files.

 

Download and Run the Cloudera Manager Installer


 
For installation purposes, the Cloudera Manager Server must have SSH access to the cluster hosts and you must log in using a root account or an account that has password-less sudo permission.

SSH access should be bi-directional.

Cloudera Manager accesses archive.cloudera.com by using yum on Red Hat systems

To access archive.cloudera.com through a proxy, modify the system configuration on the Cloudera Manager Server host and on every cluster host where you want to install CDH.

On Red Hat systems, add the following property to /etc/yum.conf:

proxy=http://server:port/
 

Preparation


 
VM creation

Create the reference virtual machine, with the following parameters:

·         Bridge network

·         Enough disk space (more than 20GB)

·         2 - 4 GB of RAM

For faster installation of OS,  you can specify the option ‘expert text’.

Network Configuration

Perform changes in the following files to setup the network configuration that will allow all cluster nodes to interact.

[root@hadoop1 ~]# cat /etc/resolv.conf

# Generated by NetworkManager

search demo.com

nameserver 192.168.1.1

 [root@hadoop1 ~]#

 

[root@hadoop1 ~]# cat /etc/sysconfig/network

NETWORKING=yes

HOSTNAME=hadoop1.demo.com

[root@hadoop1 ~]#

 

[root@hadoop1 network-scripts]# cat /etc/sysconfig/network-scripts/ifcfg-eth0

TYPE=Ethernet

BOOTPROTO=Static

IPADDR=192.169.1.110

PREFIX=21

GATEWAY=192.168.1.1

DNS1=192.168.20.20

DEFROUTE=yes

IPV4_FAILURE_FATAL=yes

IPV6INIT=no

NAME="eth0"

UUID=960b4204-f700-46fc-9a93-f1cf503508d5

ONBOOT=yes

HWADDR=00:0C:29:A5:04:FF

DEVICE=Auto_eth1

USERCTL=no

[root@hadoop1 network-scripts]#

 

[root@hadoop1 network-scripts]# cat /etc/selinux/config

 

# This file controls the state of SELinux on the system.

# SELINUX= can take one of these three values:

#     enforcing - SELinux security policy is enforced.

#     permissive - SELinux prints warnings instead of enforcing.

#     disabled - No SELinux policy is loaded.

SELINUX=disabled

 

[root@hadoop1 network-scripts]# cat /etc/yum/pluginconf.d/fastestmirror.conf

[main]

enabled=0

verbose=0

always_print_best_host = true

socket_timeout=3

#  Relative paths are relative to the cachedir (and so works for users as well

# as root).

hostfilepath=timedhosts.txt

maxhostfileage=10

maxthreads=15

#exclude=.gov, facebook

#include_only=.nl,.de,.uk,.ie

[root@hadoop1 network-scripts]#

 

 

Initialize the network by restarting the network services:

[root@hadoop1 network-scripts]# chkconfig iptables off

[root@hadoop1 network-scripts]# /etc/init.d/network restart

Shutting down interface eth0:  Device state: 3 (disconnected)                                                           [  OK  ]

Shutting down loopback interface:                          [  OK  ]

Bringing up loopback interface:                            [  OK  ]

Bringing up interface eth0:  Active connection state: activated

Active connection path: /org/freedesktop/NetworkManager/ActiveConnection/1 [  OK  ]

[root@hadoop1 network-scripts]#

 

Now update all the packages and reboot the virtual machine:

[root@hadoop1 ~]# yum update

Setup Network Security

Cluster hosts must have a working network name resolution system. Properly configuring DNS and reverse DNS meets this requirement. If you use /etc/hosts instead of DNS, all hosts files must contain consistent information about host names and addresses across all nodes.
 
For example, /etc/hosts might contain something of the form:
 
127.0.0.1     localhost.localdomain localhost
192.168.1.110 hadoop1.demo.com hadoop1

192.168.1.111 hadoop2.demo.com hadoop2




The Cloudera Manager Server must have SSH access to the cluster hosts when you run the installation or upgrade wizard.

You must log in using a root account or an account that has password-less sudo permission. For authentication during the installation and upgrade procedures, you will need to either enter the password or upload a public and private key pair for the root or sudo user account.

 

Note:

 

·         The Cloudera Manager Agent runs as root so that it can make sure the required directories are created and that processes and files are owned by the appropriate user (for example, the hdfs user and mapred user).

·         No blocking by Security-Enhanced Linux (SELinux).

·         Disable Ipv6 on all machines.

·         No blocking by iptables or firewalls; make sure port 7180 is open because it is the port used to access Cloudera Manager after installation

·         For RedHat/CentOS operating systems, make sure the/etc/sysconfig/network file on each system contains the hostname you have just set (or verified) for that system. (This does not apply to Debian/Ubuntu or SLES).

·         Cloudera Manager, CDH, and managed services use several user accounts and groups to complete their tasks. The set of user accounts and groups varies according to which components you choose to install. Do not delete these accounts or groups and do not modify their permissions and rights. Ensure no existing systems obstruct the functioning of these accounts and groups

 

Setup SSH (password less login)


 

Step #1: Generate DSA Key Pair
Use ssh-keygen command as follows:


$ ssh-keygen -t dsa
Output:

[root@hadoop01 ~]# ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/root/.ssh/id_dsa):
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
63:c2:66:e8:b2:c4:49:cf:1c:6a:17:bb:4f:88:2f:b3
root@hadoop01.localdomain.com
[root@hadoop01 ~]#

Caution:
a) Please enter a passphrase different from your account password and confirm the same.
b) The public key is written to /home/you/.ssh/id_dsa.pub.
c) The private key is written to /home/you/.ssh/id_dsa.
d) It is important you never-ever give out your private key.

Step #2: Set directory permission
Next make sure you have correct permission on .ssh directory:
[root@hadoop01 ~]# cd
[root@hadoop01 ~]# chmod 755 .ssh

Copy the id_dsa.pub to authorized_keys files.
[root@hadoop01 ~]# ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys



At this stage, shutdown the system and clone, create as many nodes as required, here for this presentation I have considered only two nodes.

Clones Customization

For every node, proceed with the following operations:

Modify the hostname of the server, change the following line in the file:

/etc/sysconfig/network

HOSTNAME=hadoop[n].demo.com

 

Where [n] = 1..2 (up to the number of nodes)

 

Modify the fixed IP address of the server, change the following line in the file:

 

/etc/sysconfig/network-scripts/ifcfg-eth0

IPADDR=192.168.11[0..2]

 

restart the networking services and reboot the server, so that the above changes takes effect:

$> /etc/init.d/network restart

$> init 6

Once the Virtual machines are setup, login to each machine as root and test the SSH connectivity.

  

Install Cloudera Manager on hadoop1


Download and run the Cloudera Manager Installer, which will simplify greatly the rest of the installation and setup process.

$> curl -O http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin

$> chmod +x cloudera-manager-installer.bin

$> ./cloudera-manager-installer.bin

 


Read the Cloudera Manager Readme and then press Enter to choose Next.



Read the Cloudera Manager License and then press Enter to choose Next. Use the arrow keys and press Enter to choose Yes to confirm you accept the license.


Read the Oracle Binary Code License Agreement and then press Enter to choose Next. Use the arrow keys and press Enter to choose Yes to confirm you accept the Oracle Binary Code License Agreement.

The Cloudera Manager installer begins installing the Oracle JDK and the Cloudera Manager repo files and then installs the packages. The installer also installs the Cloudera Manager Server.

Installation Log:



Installs the Postgres Package:


 

Note the complete URL provided for the Cloudera Manager Admin Console, including the port number, which is 7180 by default.

 

Start the Cloudera Manager Admin Console


The Cloudera Manager Admin Console enables you to use Cloudera Manager to configure, manage, and monitor Hadoop on your cluster. Before using the Cloudera Manager Admin Console, gather information about the server's URL and port.

The server URL takes the following form:

http://myhost.example.com:7180/

In a web browser, enter the URL, including the port, for the Cloudera Server. The login screen for Cloudera Manager appears.

Log into Cloudera Manager. The default credentials are: Username: admin Password: admin

 

Use Cloudera Manager for Automated CDH Installation and Configuration


To use Cloudera Manager:

1. The first time you start the Cloudera Manager Admin Console, the install wizard starts up.
2. Select Install Cloudera Standard or Cloudera Enterprise Trial.
3. Click Continue.



 
Restart the service:
 
sudo service cloudera-scm-server restart
 

Important:

All hosts in the cluster must have some way to access installation files.

As the Cloudera Manager server restarts, the user interface indicates its progress, and presents the login page when the restart has completed.

Click Continue to proceed with the installation.



To enable Cloudera Manager to automatically discover your cluster hosts where you want to install CDH, enter the cluster hostnames or IP addresses. You can also specify hostname and IP address ranges



Click Search.

Cloudera Manager identifies the hosts on your cluster to allow you to configure them for CDH. If there are a large number of hosts on your cluster, wait a few moments to allow them to be discovered and shown in the wizard.

Verify that the number of hosts shown matches the number of hosts where you want to install CDH.



Click Continue

Select the repository type you want to use for the installation.

  • To install using Parcels, select Parcels
  • To install using Packages, select Packages

Installing from parcels is recommended, if they are available for the version you want to install.

 



Click Continue

Provide credentials for authenticating with hosts


  1. Select root or enter the user name for an account that has password-less sudo permissions.


Click Continue to begin installing the Cloudera Manager Agent and Daemons on the cluster hosts.







 

When the Continue button appears at the bottom of the screen, the installation process is completed.

When you continue, the Host Inspector runs to validate the installation, and provides a summary of what it finds, including all the versions of the installed components. If the validation is successful, click Continue.


I skipped the Host Inspector and installed the following services logging to the portal.

Core Hadoop, Real-Time Delivery (previously known as HBase Services), Real-Time Query (which includes HDFS, Hive and Impala), All Services, or Custom Services.
Choose the services you want to start on your cluster
Choose the combination of services to install: Core Hadoop, Real-Time Delivery (previously known as HBase Services), Real-Time Query (which includes HDFS, Hive and Impala), All Services, or Custom Services

  • Some services depend on others; for example, HBase requires HDFS and ZooKeeper.
  • Most of the combinations install MapReduce v1. Choose the Custom Services option to install MapReduce v2 (YARN) or use the Add Service functionality to add YARN after installation completes.

Services


Cloudera Home Page and Node Status:

 
 



 
Hue

Similarly you can access the Hue administration site by accessing: http://192.168.1.110:8888, where you will be able to access the different services that you have installed on the cluster.



 


On the Top you will find the different services installed in the previous step.


Invoke the Query Editor.






Setup the Workflow here.


Users Administration Page.

 

 

-x-

 

Monday, August 25, 2014

Search and Replace in all files within a directory recursively on Linux


To search recursively through directories, looking in all the files for a particular string, and to replace that string with some string,  use:

find ./ -type f -exec sed -i 's/string1/string2/' {} \;

Where string1 is the search string and string2 is the replace string.



Oracle RAC 11g Release 2 log files

Oracle Clusterware log files


ComponentLog file location
Cluster Ready Services Daemon (crsd) Log Files
$CRS_HOME/log/hostname/crsd
Cluster Synchronization Services (CSS)
$CRS_HOME/log/hostname/cssd
Event Manager (EVM) information generated by evmd
$CRS_HOME/log/hostname/evmd
Oracle RAC RACG
$CRS_HOME/log/hostname/racg
$ORACLE_HOME/log/hostname/racg

 
 
 Oracle RAC 11g Release 2 log files
 
ComponentLog file location
Clusterware alert log
$GRID_HOME/log/alert.log
Disk Monitor daemon
$GRID_HOME/log/diskmon
OCRDUMP, OCRCHECK, OCRCONFIG, CRSCTL
$GRID_HOME/log/client
Cluster Time Synchronization Service
$GRID_HOME/log/ctssd
Grid Interprocess Communication daemon
$GRID_HOME/log/gipcd
Oracle High Availability Services daemon
$GRID_HOME/log/ohasd
Cluster Ready Services daemon
$GRID_HOME/log/crsd
Grid Plug and Play daemon
$GRID_HOME/log/gpnpd
Mulitcast Domain Name Service daemon
$GRID_HOME/log/mdnsd
Event Manager daemon
$GRID_HOME/log/evmd
RAC RACG (only used if pre-11.1 database is installed)
$GRID_HOME/log/racg
Cluster Synchronization Service daemon
$GRID_HOME/log/cssd
Server Manager
$GRID_HOME/log/srvm
HA Service Daemon Agent
$GRID_HOME/log/agent/ohasd/\
oraagent_oracle11
HA Service Daemon CSS Agent
$GRID_HOME/log/agent/ohasd/\
oracssdagent_root
HA Service Daemon ocssd Monitor Agent
$GRID_HOME/log/agent/ohasd/\
oracssdmonitor_root
HA Service Daemon Oracle Root Agent
$GRID_HOME/log/agent/\
ohasd/orarootagent_root
CRS Daemon Oracle Agent
$GRID_HOME/log/agent/\
crsd/oraagent_oracle11
CRS Daemon Oracle Root Agent
$GRID_HOME/log/ agent/\
crsd/orarootagent_root
Grid Naming Service daemon
$GRID_HOME/log/gnsd

 
 
Oracle database log file
 
ComponentLog file location
Oracle database
$ORACLE_BASE/diag/rdbms/database_name/SID/