Home / Archives For July 2016

Friday, 22 July 2016

Apache Hive : Hive table creating and data loading Example on traffic_violation Data

Hive table creating and data loading Example on traffic_violation Data

create table traffic_violation(date_of_stop String,time_of_stop timestamp,agency String,subagency String,description String,location String,latitude String,longitude float,accident String,belts String,personal String,property_damage String,fatal String,commercial String,hazmat String,commercial_vechicle String,alcohol String,work_zone String,state String,vehicle_Type String,year int,make String,model String,color String,violation_type String,charge bigint,article String,contribted_to_accident String,race String,gender String,driver_city String,driver_state String,dl_state String,arrest_type String,geolocation String)
row format delimited
fields terminated by ','
stored as textfile;

Load data in your table

load data local inpath '/root/manish/Traffic_Violations.csv' overwrite into table traffic_violation;

Downloads file traffic_violation Data

Apache Hadoop : Hive table creating and data loading Example on Crime Data

DataScience99.com 01:06 0 Comment

Hive table creating and data loading Example on Crime Data

create table Crimedata(id int,case_number String,date String, block 

String,iucr int,primary_type String,description String,location_description 

String,arrest String,domestic String,beat int,district int,ward 

int,community_area int,fbi_code int,x_coordinate bigint,y_coordinate 

bigint,year int,update_on timestamp,latitude float,longitude float,location 

float)

row format delimited
fields terminated by ','
stored as textfile;

load data local inpath '/root/data/crimes_-_2001_to_present.csv' overwrite

into table crimedata;

Downloads file Crime Data

Wednesday, 20 July 2016

SoftServe’s relationship with Cloudera will provide customers with real-time big data analytics, high performance in classical structured data analysis, more accurate predictive analytics, and business intelligence and visualisation

DataScience99.com 12:06 0 Comment

SoftServe has joined the Cloudera Connect Partner Program. Cloudera, which is offering a unified platform for big data built around open source Apache Hadoop, is working with SoftServe to help organisations gain a competitive advantage from their data by providing them with data acceleration capabilities for real-time decision-making through professional services.

image: http://www.channelbiz.co.uk/wp-content/uploads/2015/04/Tim-Stevens-of-Cloudera.jpg

Tim Stevens of ClouderaSoftServe’s new relationship with Cloudera will provide customers with real-time big data analytics, high performance in classical structured data analysis, more accurate predictive analytics, business intelligence and visualisation and network configuration optimisation.

“With their unique strengths in professional services, we are pleased to welcome SoftServe to the Cloudera Connect Partner program,” said Tim Stevens (pictured), vice president for corporate and business development at Cloudera. “SoftServe’s professional team of experts turn data into insight and advantage, so now our mutual customers are able to receive end-to-end big data solutions that deliver effective and timely business results.”

“Cloudera is the definitive leader in emerging big data technology for the enterprise, so our two companies working together is a perfect fit for SoftServe’s professional services organisation,” said Neil Fox, EVP and CTO at SoftServe.

SoftServe has longstanding expertise in the various technologies in Cloudera’s big data ecosystem, including Hadoop, HBase and Flume, and an increasing speciality in newer technologies such as Spark. This, combined with depth in analytical tools and languages, including R, Python and Scala, enables SoftServe to “deliver innovative big data solutions”, said Cloudera.

Read more at http://www.channelbiz.co.uk/2015/04/24/cloudera-adds-softserve-pro-services-to-hadoop-platform/#Qsfm8IQ4TyET9HZG.99

Apache Hadoop : Sqoop Script for Importing Data RDBMS to HDFS and RDBMS to HIVE

DataScience99.com 03:33 0 Comment

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Sqoop: “SQL to Hadoop and Hadoop to SQL”

Sqoop Import

Sqoop import command imports a table from an RDBMS to HDFS. Each record from a table is considered as a separate record in HDFS. Records can be stored as text files, or in binary representation as Avro or Sequence Files.

Importing a RDBMS to HDFS

Syntax:

$ sqoop import --connect --table --username --password --target-dir -m1

--connect Takes JDBC url and connects to database (jdbc:mysql://localhost:3306/test )

--table Source table name to be imported (sqooptest )

--username Username to connect to database (root )

--password Password of the connecting user(12345)

--target-dir Imports data to the specified directory (/output )

--m1

sqoop import --connect jdbc:mysql://localhost:3306/ecafe --table mm01_billing --username root --hive-import --hive-table mm01_billing --target-dir /apps/hive/warehouse/mm01_billing -m 1

 sqoop
 import --connect jdbc:mysql://localhost:3306/ecafe --table mm01_billing
 --username root --hive-import --hive-table mm01_billing --target-dir  
/apps/hive/warehouse/mm01_billing -m 1

HTML Home

DataScience99.com 01:11 0 Comment

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>This is a Heading</h1>
<p>This is a paragraph.</p>

</body>
</html>

About HTML

DataScience99.com 01:06 0 Comment

HTML stands for Hyper Text Markup Language, which is the most widely used language on Web to develop web pages.

HTML was created by Berners-Lee in late 1991 but "HTML 2.0" was the first standard HTML specification which was published in 1995. HTML 4.01 was a major version of HTML and it was published in late 1999. Though HTML 4.01 version is widely used but currently we are having HTML-5 version which is an extension to HTML 4.01, and this version was published in 2012.

Prerequisites

Before proceeding with this tutorial you should have a basic working knowledge with Windows or Linux operating system, additionally you must be familiar with:

Experience with any text editor like notepad, notepad++, or Editplus etc.
How to create directories and files on your computer.
How to navigate through different directories.
How to type content in a file and save them on a computer.
Understanding about images in different formats like JPEG, PNG format.

About Hadoop

DataScience99.com 00:30 0 Comment

Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides file system and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.

Hadoop YARN: This is a framework for job scheduling and cluster resource management.

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

Hadoop Map Reduce: This is YARN-based system for parallel processing of large data sets. Hadoop

Map Reduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of Commodity hardware in a reliable, fault-tolerant manner.The term Map Reduce actually refers to the following two different tasks that Hadoop programs perform:

The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).

The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.Typically both the input and the output are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.The Map Reduce framework consists of a single master

Job Tracker and one slave

Task Tracker per cluster-node. The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves Task Tracker execute the tasks as directed by the master and provide task-status information to the master periodically.The Job Tracker is a single point of failure for the Hadoop Map Reduce service which means if Job Tracker goes down, all running jobs are halted.

Hadoop Distributed File System Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the Hadoop Distributed File System (HDFS).The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on large clusters (thousands of computers) of small computer machines in a reliable, fault-tolerant manner. HDFS uses a master/slave architecture where master consists of a single

Name Node that manages the file system metadata and one or more slave Data Nodesthat store the actual data.A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of Data Nodes. The Name Node determines the mapping of blocks to the Data Nodes. The Data Nodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by Name Node.

HDFS provides a shell like any other file system and a list of commands are available to interact with the file system. These shell commands will be covered in a separate chapter along with appropriate examples.

How Does Hadoop Work?

Stage 1 A user/application can submit a job to the Hadoop (a hadoop job client) for required process by specifying the following items:The location of the input and output files in the distributed file system.The java classes in the form of jar file containing the implementation of map and reduce functions.The job configuration by setting different parameters specific to the job.

Stage 2 The Hadoop job client then submits the job (jar/executable etc) and configuration to the Job Tracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Stage 3 The Task Trackers on different nodes execute the task as per Map Reduce implementation and output of the reduce function is stored into the output files on the file system.

Advantages of Hadoop Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores. Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.

Monday, 18 July 2016

How To Install Apache MySQL PHP (LAMP Stack) on Ubuntu 16.04

DataScience99.com 11:04 0 Comment

LAMP (Linux, Apache, MySQL and PHP ) Stack is the most popular environment in PHP website development and hosting. Linux is the operating system, Apache is the popular web server developed by Apache Foundation. MySQL is relational database management system used for storing data and PHP is an development language.

This article will help you to Install Apache 2.4, MySQL 5.7 and PHP 7.0 . on Ubuntu 16.04 LTS Systems.

Step 1 – Install PHP

PHP 7 is the default available packages in Ubuntu 16.04 repositories. Simply use the following commands to update apt cache and install PHP packages on your system.

$ sudo apt update
$ sudo apt install -y php

Verify installed PHP version using following command.

PHP 7.0.4-7ubuntu2 (cli) ( NTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
    with Zend OPcache v7.0.6-dev, Copyright (c) 1999-2016, by Zend Technologies

Step 2 – Install Apache2

After installing PHP on your system, let’s start installation of Apache2 in your system. Your also required to install libapache2-mod-php module to work PHP with Apache2.

$ sudo apt install apache2 libapache2-mod-php

Step 3 – Install MySQL

Finally install mysql-server packages for MySQL database. Also install php-mysql package to use MySQL support using php. Use following command to install it.

$ sudo apt install mysql-server php-mysql

Installer will prompt for root password, This password will work for your MySQL root user. After installing MySQL execute following command for initial settings of MySQL server. You will she that script will prompt about more settings than earlier mysql versions like password validation policy etc.

$ sudo mysql_secure_installation

Step 4 – Restart Apache2, MySQL Services

After installing all services on your system, start all required services.

$ sudo systemctl restart apache2.service
$ sudo systemctl restart mysql.service

Step 5 – Open Access in Firewall

If you are using iptables, Use following commands to open port 80 for public access of webserver.

Iptables Users:

$ sudo iptables -A INPUT -m state --state NEW -p tcp --dport 80 -j ACCEPT

UFW Users:

$ sudo ufw allow 80/tcp

Step 6 – Test Setup

After completing all setup. Let’s create a info.php file website document root with following content.

<?php
 phpinfo();
?>

Saturday, 16 July 2016

Apache Hadoop : Hive Partitioning and Bucketing Example on Twitter Data

DataScience99.com 12:32 0 Comment

Hive Partitioning and Bucketing Example on Twitter Data

Overview on Hive Partitioning :

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
Overview on Hive Bucketing :
The Hive Partition can be further subdivided into Clusters or Buckets.Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts.

The dataset :

Tweet ID
Username
Text
Created Date
Profile Location
Favc
Retweet
Retweet Count
Count of Followers

Script :

Create table with Twitter Data---

create table twitter04(tweetId BIGINT, username STRING,txt STRING,CreatedAt STRING,
profileLocation STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
row format delimited
fields terminated by '\t'
stored as textfile;

Load data from Input file (Twitterdata.txt) to table (twitter) :

Load data local inpath '/root/manish/Twitterdata.txt' overwrite into table twitter04;

# If you are using data from HDFS then you don't have to mentioned "local".
Create table with partitioning--

Create table partitiontwitter(tweetId BIGINT, username STRING,txt 
STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
partitioned by(CreatedAt String,profileLocation STRING)
row format delimited
fields terminated by '\t'
stored as textfile;

Load data from twitter table to Partitioning table :

insert overwrite table partitiontwitter
partition (CreatedAt="26 04:50:56 UTC 2014",profileLocation="Chicago")
select tweetId,username,txt,favc,retweet,retcount,followerscount
from twitter04 where profileLocation='Chicago' limit 50;

Create table with bucketing--

insert overwrite table partitiontwitter
partition (CreatedAt="26 04:50:56 UTC 2014",profileLocation="Chicago")
select tweetId,username,txt,favc,retweet,retcount,followerscount
from twitter where profileLocation='Chicago' limit 50;

create table buckettwitter(tweetId BIGINT, username STRING,txt STRING,
CreatedAt STRING,favc BIGINT,retweet STRING,retcount BIGINT, followerscount BIGINT)

partitioned by(profileLocation STRING)

clustered by(tweetId) into 2 buckets

row format delimited

fields terminated by '\t'

stored as textfile;

set hive.enforce.bucketing ='true';

Load data from twitter table to Bucketing table :

insert overwrite table buckettwitter partition(profileLocation="Chicago")

select tweetId BIGINT, username STRING,txt STRING,CreatedAt STRING,favc BIGINT,retweet STRING,retcount BIGINT, followerscount BIGINT

from twitter

where profileLocation = 'Chicago' limit 100;

Apache Hadoop : Hive Table Creating

DataScience99.com 11:25 0 Comment

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Overview on Hive Table

create table consumer_complaints (date timestamp,product String,sub_product String,Issue String,Company_response_to_consumer
String,Timely_response String,Consumer_disputed String,ComplaintID int)

row format delimited

fields terminated by ','

lines terminated by'\n'
stored as textfile;

load data local inpath '/root/manish/Consumer_Complaints.csv' overwrite into table consumer_complaints;

Downloads

Consumer_Complaints

Friday, 15 July 2016

Alter Table Statement in Hive

DataScience99.com 12:58 0 Comment

Alter Table Statement

It is used to alter a table in Hive.

Syntax

The statement takes any of the following syntaxes based on what attributes we wish to modify in a table.

ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Rename To… Statement

The following query renames the table from city to Delhi.

hive> ALTER TABLE city RENAME TO Delhi;

Output:

Table renamed successfully.

Wednesday, 13 July 2016

Create Table Command

DataScience99.com 04:40 0 Comment

Create Table is a statement used to create a table in Hive. The syntax and example are as follows:

Syntax

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

Example

 year name state
2012 manish del
2013 nitin jammu
2014 nidhi jaipur
2015 kamal bangalore
2016 jatin mumbai
2017 bajaj rohtak
2018 indira gurgaon
2019 kamal jaipur
2020 manish del
2021 nitin jammu
2022 nidhi jaipur
2023 kamal bangalore
2024 jatin mumbai
2025 bajaj rohtak
2026 indira gurgaon
2027 kamal jaipur
2028 manish del
2029 nitin jammu
2030 nidhi jaipur
2031 kamal bangalore

hive>create table City006(year Int,name String,city String)
     ROW FORMAT DELIMITED
     FIELDS TERMINATED BY ','
     LINES TERMINATED BY '\n'
     STORED AS TEXTFILE;

If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already exists.

On successful creation of table, you get to see the following response:

OK
Time taken: 5.905 seconds
hive>

Friday, 22 July 2016

Hive table creating and data loading Example on traffic_violation Data

Load data in your table

Wednesday, 20 July 2016

Sqoop Import

Importing a RDBMS to HDFS

Prerequisites

Monday, 18 July 2016

Step 1 – Install PHP

Step 2 – Install Apache2

Step 3 – Install MySQL

Step 4 – Restart Apache2, MySQL Services

Step 5 – Open Access in Firewall

Iptables Users:

UFW Users:

Step 6 – Test Setup

Saturday, 16 July 2016

Friday, 15 July 2016

Alter Table Statement

Syntax

Rename To… Statement

Output:

Wednesday, 13 July 2016

Syntax

Example

MySQL

Pages

About Me

Pages

Blog Archive

Search This Blog