Home / Archives For 2016

Tuesday, 2 August 2016

Machine Learning: Google and Microsoft Want Every Company to Scrutinize You with AI

When some patients of Dartmouth-Hitchcock Medical Center in New Hampshire step on their bathroom scale at home, Microsoft’s computers know about it. The corporation’s machines also get blood pressure readings taken at home. And they can even listen to calls between nurses and patients to gauge a person’s emotional state. Microsoft’s artificial intelligence software parses that data to try and warn patients and staff of emerging health problems before any human notices.

The hospital is previewing both the future of health care and of Microsoft’s business. It’s using a suite of new “cognitive” services recently added to Microsoft’s cloud computing service, called Azure. The company says renting out its machine-learning technology will unlock new profits, and enable companies of all kinds to subject their data—and customers—to artificial-intelligence techniques previously limited to computing giants.

“Customers are going to mature from classic cloud services to services that use elements of machine learning and AI,” says Herain Oberoi, director of product management at Microsoft, who oversees the company’s cloud machine-learning services. “Every company I talk with has someone extremely senior tasked with thinking about how to make this technology work for them.”

Microsoft’s competitors Google, IBM, and Amazon are making the same bet. Google announced in June that it had invented a new kind of chip to accelerate machine-learning software and make its cloud services more competitive. The company lags Amazon and Microsoft in the cloud market, and CEO Sundar Pichai has said machine-learning services provide a way for Google to differentiate itself. Amazon’s cloud division, Amazon Web Services, launched its first machine-learning cloud services last year, and in June the group’s head, Andy Jassy, pledged to expand them significantly in the coming months.

Amazon and its largest competitors stepped up their investments in machine-learning technology in recent years after breakthroughs in software that can be trained to do tasks such as interpret photos or speech (see “10 Breakthrough Technologies 2013: Deep Learning”).

Some of the first consumer products to take advantage of those breakthroughs were Amazon’s Alexa voice-operated home assistant and Google’s new Photos service, which understands the content of images and has more than 200 million users. Adding machine learning to the cloud services that corporations already use to outsource tasks such as data storage and analysis is seen as another way to extract money from the technology and enhance the very lucrative market. IDC estimates that corporations spent almost $70 billion with cloud providers last year, and predicts that will double before the end of the decade.

Rob Craft, who leads product management for Google's cloud machine-learning offerings, says that most companies are in a position to benefit from machine learning right away because they have a lot of data on hand about their operations, business, and customers. “Our goal is to help them have more direct value from that data,” he says.

The most straightforward of the new services offered by Google and others do things like describe the content of images, transcribe audio files such as phone calls, extract key terms from text, or translate text between languages. Although seen as lagging behind Google in machine-learning technology, Microsoft and IBM have so far rolled out the broadest range of such services, known as APIs.

Microsoft has an API that tries to decipher facial expressions, for example. IBM has one that assesses the personality of the author of text such as social media posts. Marketing company Influential uses it to help brands such as Corona and Red Bull identify the most useful social media users for promotional efforts. Different APIs can be combined. For example, a company could set up a system that spots its logo in social media images, notes the facial expression of any people in the photo, and extracts key terms from any accompanying text.

Many key software components needed to build the kind of machine-learning systems that Google and others hope will be so valuable are free (see “Facebook Joins Stampede of Tech Giants Giving Away AI Technology”). But Jimoh Ovbiagele, cofounder and chief technology officer at startup ROSS Intelligence, which provides software that speeds up legal research to major law firms, says that the time and expense of building and operating a top-notch machine-learning system means many companies are better off renting the technology.

“It makes sense to stand on the shoulders of giants,” says Ovbiagele. ROSS’s ability to understand legal questions is built on IBM’s suite of language processing technology, some of which originated with the Watson computer that beat two Jeopardy! champions in 2011.

Chris Curran, chief technologist with PwC, says most large corporations are still far from ready to spend significantly on machine-learning services, though. He estimates about three quarters are in “watch and learn” mode, waiting to see what these new capabilities offer.

And while the new services from Microsoft and others make it easy for non-technology companies like Dartmouth-Hitchcock Medical Center to use preprogrammed machine-learning systems, the technology is most valuable when customized for an organization’s specific needs, says Curran. Google and Microsoft’s image APIs are good at general assessments, such as whether a photo contains a cat or a skyscraper, for example. But a food manufacturer would get more value from a vision system able to spot specific defects in items on its production line.

All the cloud providers either already offer or have promised ways for customers to train algorithms on their own data, for their own problems. But creating customized artificial intelligence software can only be made so easy, says Curran. “You need to have the right people and expertise, and those are in short supply,” he says.

Full News= https://www.technologyreview.com/s/602037/google-and-microsoft-want-every-company-to-scrutinize-you-with-ai/

Friday, 22 July 2016

Apache Hive : Hive table creating and data loading Example on traffic_violation Data

DataScience99.com 03:06 0 Comment

Hive table creating and data loading Example on traffic_violation Data

create table traffic_violation(date_of_stop String,time_of_stop timestamp,agency String,subagency String,description String,location String,latitude String,longitude float,accident String,belts String,personal String,property_damage String,fatal String,commercial String,hazmat String,commercial_vechicle String,alcohol String,work_zone String,state String,vehicle_Type String,year int,make String,model String,color String,violation_type String,charge bigint,article String,contribted_to_accident String,race String,gender String,driver_city String,driver_state String,dl_state String,arrest_type String,geolocation String)
row format delimited
fields terminated by ','
stored as textfile;

Load data in your table

load data local inpath '/root/manish/Traffic_Violations.csv' overwrite into table traffic_violation;

Downloads file traffic_violation Data

Apache Hadoop : Hive table creating and data loading Example on Crime Data

DataScience99.com 01:06 0 Comment

Hive table creating and data loading Example on Crime Data

create table Crimedata(id int,case_number String,date String, block 

String,iucr int,primary_type String,description String,location_description 

String,arrest String,domestic String,beat int,district int,ward 

int,community_area int,fbi_code int,x_coordinate bigint,y_coordinate 

bigint,year int,update_on timestamp,latitude float,longitude float,location 

float)

row format delimited
fields terminated by ','
stored as textfile;

load data local inpath '/root/data/crimes_-_2001_to_present.csv' overwrite

into table crimedata;

Downloads file Crime Data

Wednesday, 20 July 2016

SoftServe’s relationship with Cloudera will provide customers with real-time big data analytics, high performance in classical structured data analysis, more accurate predictive analytics, and business intelligence and visualisation

DataScience99.com 12:06 0 Comment

SoftServe has joined the Cloudera Connect Partner Program. Cloudera, which is offering a unified platform for big data built around open source Apache Hadoop, is working with SoftServe to help organisations gain a competitive advantage from their data by providing them with data acceleration capabilities for real-time decision-making through professional services.

image: http://www.channelbiz.co.uk/wp-content/uploads/2015/04/Tim-Stevens-of-Cloudera.jpg

Tim Stevens of ClouderaSoftServe’s new relationship with Cloudera will provide customers with real-time big data analytics, high performance in classical structured data analysis, more accurate predictive analytics, business intelligence and visualisation and network configuration optimisation.

“With their unique strengths in professional services, we are pleased to welcome SoftServe to the Cloudera Connect Partner program,” said Tim Stevens (pictured), vice president for corporate and business development at Cloudera. “SoftServe’s professional team of experts turn data into insight and advantage, so now our mutual customers are able to receive end-to-end big data solutions that deliver effective and timely business results.”

“Cloudera is the definitive leader in emerging big data technology for the enterprise, so our two companies working together is a perfect fit for SoftServe’s professional services organisation,” said Neil Fox, EVP and CTO at SoftServe.

SoftServe has longstanding expertise in the various technologies in Cloudera’s big data ecosystem, including Hadoop, HBase and Flume, and an increasing speciality in newer technologies such as Spark. This, combined with depth in analytical tools and languages, including R, Python and Scala, enables SoftServe to “deliver innovative big data solutions”, said Cloudera.

Read more at http://www.channelbiz.co.uk/2015/04/24/cloudera-adds-softserve-pro-services-to-hadoop-platform/#Qsfm8IQ4TyET9HZG.99

Apache Hadoop : Sqoop Script for Importing Data RDBMS to HDFS and RDBMS to HIVE

DataScience99.com 03:33 0 Comment

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Sqoop: “SQL to Hadoop and Hadoop to SQL”

Sqoop Import

Sqoop import command imports a table from an RDBMS to HDFS. Each record from a table is considered as a separate record in HDFS. Records can be stored as text files, or in binary representation as Avro or Sequence Files.

Importing a RDBMS to HDFS

Syntax:

$ sqoop import --connect --table --username --password --target-dir -m1

--connect Takes JDBC url and connects to database (jdbc:mysql://localhost:3306/test )

--table Source table name to be imported (sqooptest )

--username Username to connect to database (root )

--password Password of the connecting user(12345)

--target-dir Imports data to the specified directory (/output )

--m1

sqoop import --connect jdbc:mysql://localhost:3306/ecafe --table mm01_billing --username root --hive-import --hive-table mm01_billing --target-dir /apps/hive/warehouse/mm01_billing -m 1

 sqoop
 import --connect jdbc:mysql://localhost:3306/ecafe --table mm01_billing
 --username root --hive-import --hive-table mm01_billing --target-dir  
/apps/hive/warehouse/mm01_billing -m 1

HTML Home

DataScience99.com 01:11 0 Comment

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>This is a Heading</h1>
<p>This is a paragraph.</p>

</body>
</html>

About HTML

DataScience99.com 01:06 0 Comment

HTML stands for Hyper Text Markup Language, which is the most widely used language on Web to develop web pages.

HTML was created by Berners-Lee in late 1991 but "HTML 2.0" was the first standard HTML specification which was published in 1995. HTML 4.01 was a major version of HTML and it was published in late 1999. Though HTML 4.01 version is widely used but currently we are having HTML-5 version which is an extension to HTML 4.01, and this version was published in 2012.

Prerequisites

Before proceeding with this tutorial you should have a basic working knowledge with Windows or Linux operating system, additionally you must be familiar with:

Experience with any text editor like notepad, notepad++, or Editplus etc.
How to create directories and files on your computer.
How to navigate through different directories.
How to type content in a file and save them on a computer.
Understanding about images in different formats like JPEG, PNG format.

About Hadoop

DataScience99.com 00:30 0 Comment

Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides file system and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.

Hadoop YARN: This is a framework for job scheduling and cluster resource management.

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

Hadoop Map Reduce: This is YARN-based system for parallel processing of large data sets. Hadoop

Map Reduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of Commodity hardware in a reliable, fault-tolerant manner.The term Map Reduce actually refers to the following two different tasks that Hadoop programs perform:

The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).

The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.Typically both the input and the output are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.The Map Reduce framework consists of a single master

Job Tracker and one slave

Task Tracker per cluster-node. The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves Task Tracker execute the tasks as directed by the master and provide task-status information to the master periodically.The Job Tracker is a single point of failure for the Hadoop Map Reduce service which means if Job Tracker goes down, all running jobs are halted.

Hadoop Distributed File System Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the Hadoop Distributed File System (HDFS).The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on large clusters (thousands of computers) of small computer machines in a reliable, fault-tolerant manner. HDFS uses a master/slave architecture where master consists of a single

Name Node that manages the file system metadata and one or more slave Data Nodesthat store the actual data.A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of Data Nodes. The Name Node determines the mapping of blocks to the Data Nodes. The Data Nodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based on instruction given by Name Node.

HDFS provides a shell like any other file system and a list of commands are available to interact with the file system. These shell commands will be covered in a separate chapter along with appropriate examples.

How Does Hadoop Work?

Stage 1 A user/application can submit a job to the Hadoop (a hadoop job client) for required process by specifying the following items:The location of the input and output files in the distributed file system.The java classes in the form of jar file containing the implementation of map and reduce functions.The job configuration by setting different parameters specific to the job.

Stage 2 The Hadoop job client then submits the job (jar/executable etc) and configuration to the Job Tracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Stage 3 The Task Trackers on different nodes execute the task as per Map Reduce implementation and output of the reduce function is stored into the output files on the file system.

Advantages of Hadoop Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores. Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.

Monday, 18 July 2016

How To Install Apache MySQL PHP (LAMP Stack) on Ubuntu 16.04

DataScience99.com 11:04 0 Comment

LAMP (Linux, Apache, MySQL and PHP ) Stack is the most popular environment in PHP website development and hosting. Linux is the operating system, Apache is the popular web server developed by Apache Foundation. MySQL is relational database management system used for storing data and PHP is an development language.

This article will help you to Install Apache 2.4, MySQL 5.7 and PHP 7.0 . on Ubuntu 16.04 LTS Systems.

Step 1 – Install PHP

PHP 7 is the default available packages in Ubuntu 16.04 repositories. Simply use the following commands to update apt cache and install PHP packages on your system.

$ sudo apt update
$ sudo apt install -y php

Verify installed PHP version using following command.

PHP 7.0.4-7ubuntu2 (cli) ( NTS )
Copyright (c) 1997-2016 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies
    with Zend OPcache v7.0.6-dev, Copyright (c) 1999-2016, by Zend Technologies

Step 2 – Install Apache2

After installing PHP on your system, let’s start installation of Apache2 in your system. Your also required to install libapache2-mod-php module to work PHP with Apache2.

$ sudo apt install apache2 libapache2-mod-php

Step 3 – Install MySQL

Finally install mysql-server packages for MySQL database. Also install php-mysql package to use MySQL support using php. Use following command to install it.

$ sudo apt install mysql-server php-mysql

Installer will prompt for root password, This password will work for your MySQL root user. After installing MySQL execute following command for initial settings of MySQL server. You will she that script will prompt about more settings than earlier mysql versions like password validation policy etc.

$ sudo mysql_secure_installation

Step 4 – Restart Apache2, MySQL Services

After installing all services on your system, start all required services.

$ sudo systemctl restart apache2.service
$ sudo systemctl restart mysql.service

Step 5 – Open Access in Firewall

If you are using iptables, Use following commands to open port 80 for public access of webserver.

Iptables Users:

$ sudo iptables -A INPUT -m state --state NEW -p tcp --dport 80 -j ACCEPT

UFW Users:

$ sudo ufw allow 80/tcp

Step 6 – Test Setup

After completing all setup. Let’s create a info.php file website document root with following content.

<?php
 phpinfo();
?>

Saturday, 16 July 2016

Apache Hadoop : Hive Partitioning and Bucketing Example on Twitter Data

DataScience99.com 12:32 0 Comment

Hive Partitioning and Bucketing Example on Twitter Data

Overview on Hive Partitioning :

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
Overview on Hive Bucketing :
The Hive Partition can be further subdivided into Clusters or Buckets.Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts.

The dataset :

Tweet ID
Username
Text
Created Date
Profile Location
Favc
Retweet
Retweet Count
Count of Followers

Script :

Create table with Twitter Data---

create table twitter04(tweetId BIGINT, username STRING,txt STRING,CreatedAt STRING,
profileLocation STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
row format delimited
fields terminated by '\t'
stored as textfile;

Load data from Input file (Twitterdata.txt) to table (twitter) :

Load data local inpath '/root/manish/Twitterdata.txt' overwrite into table twitter04;

# If you are using data from HDFS then you don't have to mentioned "local".
Create table with partitioning--

Create table partitiontwitter(tweetId BIGINT, username STRING,txt 
STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
partitioned by(CreatedAt String,profileLocation STRING)
row format delimited
fields terminated by '\t'
stored as textfile;

Load data from twitter table to Partitioning table :

insert overwrite table partitiontwitter
partition (CreatedAt="26 04:50:56 UTC 2014",profileLocation="Chicago")
select tweetId,username,txt,favc,retweet,retcount,followerscount
from twitter04 where profileLocation='Chicago' limit 50;

Create table with bucketing--

insert overwrite table partitiontwitter
partition (CreatedAt="26 04:50:56 UTC 2014",profileLocation="Chicago")
select tweetId,username,txt,favc,retweet,retcount,followerscount
from twitter where profileLocation='Chicago' limit 50;

create table buckettwitter(tweetId BIGINT, username STRING,txt STRING,
CreatedAt STRING,favc BIGINT,retweet STRING,retcount BIGINT, followerscount BIGINT)

partitioned by(profileLocation STRING)

clustered by(tweetId) into 2 buckets

row format delimited

fields terminated by '\t'

stored as textfile;

set hive.enforce.bucketing ='true';

Load data from twitter table to Bucketing table :

insert overwrite table buckettwitter partition(profileLocation="Chicago")

select tweetId BIGINT, username STRING,txt STRING,CreatedAt STRING,favc BIGINT,retweet STRING,retcount BIGINT, followerscount BIGINT

from twitter

where profileLocation = 'Chicago' limit 100;

DataScience99