Download Introducing Data Science - Big Data, Machine Learning and more, using Python tools (2016).pdf PDF

TitleIntroducing Data Science - Big Data, Machine Learning and more, using Python tools (2016).pdf
File Size14.6 MB
Total Pages322
Table of Contents
                            Front cover
brief contents
about this book
	Whom this book is for
	Code conventions and downloads
about the authors
	Author Online
about the cover illustration
1 Data science in a big data world
	1.1 Benefits and uses of data science and big data
	1.2 Facets of data
		1.2.1 Structured data
		1.2.2 Unstructured data
		1.2.3 Natural language
		1.2.4 Machine-generated data
		1.2.5 Graph-based or network data
		1.2.6 Audio, image, and video
		1.2.7 Streaming data
	1.3 The data science process
		1.3.1 Setting the research goal
		1.3.2 Retrieving data
		1.3.3 Data preparation
		1.3.4 Data exploration
		1.3.5 Data modeling or model building
		1.3.6 Presentation and automation
	1.4 The big data ecosystem and data science
		1.4.1 Distributed file systems
		1.4.2 Distributed programming framework
		1.4.3 Data integration framework
		1.4.4 Machine learning frameworks
		1.4.5 NoSQL databases
		1.4.6 Scheduling tools
		1.4.7 Benchmarking tools
		1.4.8 System deployment
		1.4.9 Service programming
		1.4.10 Security
	1.5 An introductory working example of Hadoop
	1.6 Summary
2 The data science process
	2.1 Overview of the data science process
		2.1.1 Don’t be a slave to the process
	2.2 Step 1: Defining research goals and creating a project charter
		2.2.1 Spend time understanding the goals and context of your research
		2.2.2 Create a project charter
	2.3 Step 2: Retrieving data
		2.3.1 Start with data stored within the company
		2.3.2 Don’t be afraid to shop around
		2.3.3 Do data quality checks now to prevent problems later
	2.4 Step 3: Cleansing, integrating, and transforming data
		2.4.1 Cleansing data
		2.4.2 Correct errors as early as possible
		2.4.3 Combining data from different data sources
		2.4.4 Transforming data
	2.5 Step 4: Exploratory data analysis
	2.6 Step 5: Build the models
		2.6.1 Model and variable selection
		2.6.2 Model execution
		2.6.3 Model diagnostics and model comparison
	2.7 Step 6: Presenting findings and building applications on top of them
	2.8 Summary
3 Machine learning
	3.1 What is machine learning and why should you care about it?
		3.1.1 Applications for machine learning in data science
		3.1.2 Where machine learning is used in the data science process
		3.1.3 Python tools used in machine learning
	3.2 The modeling process
		3.2.1 Engineering features and selecting a model
		3.2.2 Training your model
		3.2.3 Validating a model
		3.2.4 Predicting new observations
	3.3 Types of machine learning
		3.3.1 Supervised learning
		3.3.2 Unsupervised learning
	3.4 Semi-supervised learning
	3.5 Summary
4 Handling large data on a single computer
	4.1 The problems you face when handling large data
	4.2 General techniques for handling large volumes of data
		4.2.1 Choosing the right algorithm
		4.2.2 Choosing the right data structure
		4.2.3 Selecting the right tools
	4.3 General programming tips for dealing with large data sets
		4.3.1 Don’t reinvent the wheel
		4.3.2 Get the most out of your hardware
		4.3.3 Reduce your computing needs
	4.4 Case study 1: Predicting malicious URLs
		4.4.1 Step 1: Defining the research goal
		4.4.2 Step 2: Acquiring the URL data
		4.4.3 Step 4: Data exploration
		4.4.4 Step 5: Model building
	4.5 Case study 2: Building a recommender system inside a database
		4.5.1 Tools and techniques needed
		4.5.2 Step 1: Research question
		4.5.3 Step 3: Data preparation
		4.5.4 Step 5: Model building
		4.5.5 Step 6: Presentation and automation
	4.6 Summary
5 First steps in big data
	5.1 Distributing data storage and processing with frameworks
		5.1.1 Hadoop: a framework for storing and processing large data sets
		5.1.2 Spark: replacing MapReduce for better performance
	5.2 Case study: Assessing risk when loaning money
		5.2.1 Step 1: The research goal
		5.2.2 Step 2: Data retrieval
		5.2.3 Step 3: Data preparation
		5.2.4 Step 4: Data exploration & Step 6: Report building
	5.3 Summary
6 Join the NoSQL movement
	6.1 Introduction to NoSQL
		6.1.1 ACID: the core principle of relational databases
		6.1.2 CAP Theorem: the problem with DBs on many nodes
		6.1.3 The BASE principles of NoSQL databases
		6.1.4 NoSQL database types
	6.2 Case study: What disease is that?
		6.2.1 Step 1: Setting the research goal
		6.2.2 Steps 2 and 3: Data retrieval and preparation
		6.2.3 Step 4: Data exploration
		6.2.4 Step 3 revisited: Data preparation for disease profiling
		6.2.5 Step 4 revisited: Data exploration for disease profiling
		6.2.6 Step 6: Presentation and automation
	6.3 Summary
7 The rise of graph databases
	7.1 Introducing connected data and graph databases
		7.1.1 Why and when should I use a graph database?
	7.2 Introducing Neo4j: a graph database
		7.2.1 Cypher: a graph query language
	7.3 Connected data example: a recipe recommendation engine
		7.3.1 Step 1: Setting the research goal
		7.3.2 Step 2: Data retrieval
		7.3.3 Step 3: Data preparation
		7.3.4 Step 4: Data exploration
		7.3.5 Step 5: Data modeling
		7.3.6 Step 6: Presentation
	7.4 Summary
8 Text mining and text analytics
	8.1 Text mining in the real world
	8.2 Text mining techniques
		8.2.1 Bag of words
		8.2.2 Stemming and lemmatization
		8.2.3 Decision tree classifier
	8.3 Case study: Classifying Reddit posts
		8.3.1 Meet the Natural Language Toolkit
		8.3.2 Data science process overview and step 1: The research goal
		8.3.3 Step 2: Data retrieval
		8.3.4 Step 3: Data preparation
		8.3.5 Step 4: Data exploration
		8.3.6 Step 3 revisited: Data preparation adapted
		8.3.7 Step 5: Data analysis
		8.3.8 Step 6: Presentation and automation
	8.4 Summary
9 Data visualization to the end user
	9.1 Data visualization options
	9.2 Crossfilter, the JavaScript MapReduce library
		9.2.1 Setting up everything
		9.2.2 Unleashing Crossfilter to filter the medicine data set
	9.3 Creating an interactive dashboard with dc.js
	9.4 Dashboard development tools
	9.5 Summary
Appendix A—Setting up Elasticsearch
	A.1 Linux installation
	A.2 Windows installation
Appendix B—Setting up Neo4j
	B.1 Linux installation
	B.2 Windows installation
Appendix C—Installing MySQL server
	C.1 Windows installation
	C.2 Linux installation
Appendix D—Setting up Anaconda with a virtual environment
	D.1 Linux installation
	D.2 Windows installation
	D.3 Setting up the environment
Back cover
Document Text Contents
Page 1


Davy Cielen
Arno D. B. Meysman
Mohamed Ali

Big data, machine learning, and more, using Python tools

Page 2

Introducing Data Science

Page 161

140 CHAPTER 5 First steps in big data
Choose the Hive data, and default as user in the next screen (figure 5.18). Select raw
as Tables to select and select every column for import; then click the button Load and
Finish to complete this step.

After this step, it will take a few seconds to load the data in Qlik (figure 5.19).

Step 2: Create the report
Choose Edit the sheet to start building the report. This will add the report editor (fig-
ure 5.20).

Figure 5.18 Hive interface raw data column overview

Figure 5.19 A confirmation that the data is loaded in Qlik

Page 162

141Case study: Assessing risk when loaning money
Substep 1: Adding a selection filter to the report The first thing we’ll add to the report
is a selection box that shows us why each person wants a loan. To achieve this, drop the
title measure from the left asset panel on the report pane and give it a comfortable size
and position (figure 5.21). Click on the Fields table so you can drag and drop fields.

Figure 5.20 An editor screen for reports opens

Figure 5.21 Drag the title from the left Fields pane to the report pane.

Page 321

unsupervised machine
learning 72–81

case study 73–76
comparing accuracy of origi-

nal data set with latent
variables 78–79

discerning simplified latent
structure from data 73

grouping similar observations
to gain insight from dis-
tribution of data 79–81

interpreting new variables

overview 72
URLs (uniform resource loca-

malicious, predicting

acquiring URL data

data exploration 105–106
defining research goals 104
model building 106–108

overview 174
User DSN, Hortonworks

option 138
User nodes 199
user preferences 206
user-defined function 115
uses of data science and big

data 2–3
utf-8 133


v argument 265
validating models 64–65
validation strategies 64
value variable 266
.valueAccessor() method 270
values, missing 34–35
value.stockAvg 266, 270

interpreting new
variables 76–77

comparing accuracy of orig-

inal data set with 78–79
finding in wine quality data

set 73–76
reducing number of 41–42
selection of, building models

and 48–49
turning into dummies 42–43

variablesInTable argument,
CreateTable function 261

VB tag 228
VBD tag 228
VBG tag 228
VBN tag 228
VBP tag 228
VBZ tag 228
vertices 193
video 8
views, simulating joins using 39
VirtualBox tool 15, 125
visualizing data 253–274

creating interactive dashboard
with dc.js 267–271

Crossfilter.js library 257–266
overview 257
setting up dc.js

application 258–262
to filter medicine data

set 262–266
dashboard development

tools 272–274
options for 254–256
overview 253

VM software 125


Wald, Abraham 63
WAMP (Windows, Apache,

MySQL, PHP) 258
WDT tag 228
webhdfs interface 130
Weka 101
whitespace, redundant 33

Wikibon 6
Wikipedia API 170, 186

Anaconda package
installation 289

Elasticsearch installation
on 277–279

MySQL server
installation 284–285

Neo4j installation 282–283
Windows command window 279
Windows Hortonworks ODBC

configuration dialog
box 136

Windows ODBC manager 135
Wine Quality Data Set 73
Wolfram Alpha engine 222
word filtering

overview 237
stopping 227

word-tokenization 228, 239
word_cloud library 188
words, lowercasing 227
WP tag 228
WP$ tag 228
WRB tag 228
WWF (World Wildlife Fund) 3


x-axis 269
XAMPP (Cross Environment,

Apache, MySQL, PHP,
Perl) 258

xCharts 256
XOR operator 110


yum -y install python-pip
command 126


zipfile 130

Page 322

Cielen ● Meysman ● Ali

any companies need developers with data science skills
to work on projects ranging from social media market-
ing to machine learning. Discovering what you need to

learn to begin a career as a data scientist can seem bewildering.
This book is designed to help you get started.

Introducing Data Science explains vital data science concepts
and teaches you how to accomplish the fundamental tasks that
occupy data scientists. You’ll explore data visualization, graph
databases, the use of NoSQL, and the data science process.
You’ll use the Python language and common Python libraries
as you experience fi rsthand the challenges of dealing with data
at scale. Discover how Python allows you to gain insights from
data sets so big that they need to be stored on multiple ma-
chines, or from data moving so quickly that no single machine
can handle it. This book gives you hands-on experience with
the most popular Python data science libraries, Scikit-learn
and StatsModels. After reading this book, you’ll have the solid
foundation you need to start a career in data science.

What’s Inside
● Handling large data
● Introduction to machine learning
● Using Python to work with data
● Writing data science algorithms

Davy Cielen, Arno D. B. Meysman, and Mohamed Ali are the
founders and managing partners of Optimately and Maiton,
where they focus on developing data science projects and
solutions in various sectors.

To download their free eBook in PDF, ePub, and Kindle formats,
owners of this book should visit

$44.99 / Can $51.99 [INCLUDING eBOOK]

Introducing Data Science



“Read this book if you want to get a quick
overview of data science,

with lots of examples
to get you started!”—Alvin Raj, Oracle
“The map that will help you navigate the

data science oceans.”
—Marius Butuc, Shopify

“Covers the processes involved in data science
from end to end…

A complete overview.”—Heather Campbell, Kainos
“A must-read for anyone who wants to get into the

data science world.”—Hector Cuesta
Big Data Bootcamp


Similer Documents