Designing and Building Big Data Applications
This four day training for designing and building Big Data applications prepares you to analyze and solve real-world problems using Apache Hadoop and associated tools in the Enterprise Data Hub (EDH).
You will work through the entire process of designing and building solutions, including ingesting data, determining the appropriate file format for storage, processing the stored data, and presenting the results to the end-user in an easy-to-digest form. Go beyond MapReduce to use additional elements of the EDH and develop converged applications that are highly relevant to the business.
- » This course is best suited to developers, engineers, architects, data scientist who want to use use Hadoop and related tools to solve real-world problems.
Back to Top
- » Hands-on experience on Apache Hadoop
- » Good knowledge of Java and basic familiarity with Linux are required
- » Experience with SQL is helpful
Back to Top
- » Creating a data set with Kite SDK
- » Developing custom Flume components for data ingestion
- » Managing a multi-stage workflow with Oozie
- » Analyzing data set with Pig
- » Analyzing data with Crunch
- » Writing user-defined functions for Hive and Impala
- » Transforming data with Morphlines
- » Indexing data with Cloudera Search
Back to Top
- Application Architecture
- Scenario Explanation
- Understanding the Development Environment
- Identifying and Collecting Input Data
- Selecting Tools for Data Processing and Analysis
- Presenting Results to the Use
- Defining and Using Data Sets
- Metadata Management
- What is Apache Avro?
- Avro Schemas
- Avro Schema Evolution
- Selecting a File Format
- Performance Considerations
- Using the Kite SDK Data Module
- What is the Kite SDK?
- Fundamental Data Module Concepts
- Creating New Data Sets Using the Kite SDK
- Loading, Accessing, and Deleting a Data Set
- Importing Relational Data with Apache Sqoop
- What is Apache Sqoop?
- Basic Imports
- Limiting Results
- Improving Sqoop--s Performance
- Sqoop 2
- Capturing Data with Apache Flume
- What is Apache Flume?
- Basic Flume Architecture
- Flume Sources
- Flume Sinks
- Flume Configuration
- Logging Application Events to Hadoop
- Developing Custom Flume Components
- Flume Data Flow and Common Extension Points
- Custom Flume Sources
- Developing a Flume Pollable Source
- Developing a Flume Event-Driven Source
- Custom Flume Interceptors
- Developing a Header-Modifying Flume Interceptor
- Developing a Filtering Flume Interceptor
- Writing Avro Objects with a Custom Flume Interceptor
- Managing Workflows with Apache Oozie
- The Need for Workflow Management
- What is Apache Oozie?
- Defining an Oozie Workflow
- Validation, Packaging, and Deployment
- Running and Tracking Workflows Using the CLI
- Hue UI for Oozie
- Analyze data set with Pig
- What is Apache Pig?
- Pig's Features
- Basic Data Analysis with Pig
- Filtering and Sorting Data
- Commonly-Used Functions
- Processing Complex Data with Pig
- Techniques for Combining Data Sets
- Pig Troubleshooting and Optimization
- Processing Data Pipelines with Apache Crunch
- What is Apache Crunch?
- Understanding the Crunch Pipeline
- Comparing Crunch to Java MapReduce
- Working with Crunch Projects
- Reading and Writing Data in Crunch
- Data Collection API Functions
- Utility Classes in the Crunch API
- Working with Tables in Apache Hive
- What is Apache Hive?
- Accessing Hive
- Basic Query Syntax
- Creating and Populating Hive Tables
- How Hive Reads Data
- Using the RegexSerDe in Hive
- Developing User-Defined Functions
- What are User-Defined Functions?
- Implementing a User-Defined Function
- Deploying Custom Libraries in Hive
- Registering a User-Defined Function in Hive
- Executing Interactive Queries with Impala
- What is Impala?
- Comparing Hive to Impala
- Running Queries in Impala
- Support for User-Defined Functions
- Data and Metadata Management
- Understanding Cloudera Search
- What is Cloudera Search?
- Search Architecture
- Supported Document Formats
- Indexing Data with Cloudera Search
- Collection and Schema Management
- Indexing Data in Batch Mode
- Indexing Data in Near Real Time
- Presenting Results to Users
- Solr Query Syntax
- Building a Search UI with Hue
- Accessing Impala through JDBC
- Powering a Custom Web Application with Impala and Search
Back to Top
We ensure your success by asking all
students to take a FREE Skill Assessment test.
These short, instructor-written tests are an objective measure of your current skills that help us determine whether or not you will be able to meet your goals by attending this course at your current skill level. If we determine that you need additional preparation or training in order to gain the most value from this course, we will recommend cost-effective solutions that you can use to get ready for the course.
Our required skill-assessments ensure that:
- All students in the class are at a comparable skill level, so the class can run smoothly without beginners slowing down the class for everyone else.
- NetCom students enjoy one of the industry's highest success rates, and pass rates when a certification exam is involved.
- We stay committed to providing you real value. Again, your success is paramount; we will register you only if you have the skills to succeed.
This assessment is for your benefit and best taken without any preparation or reference materials, so your skills can be objectively measured.
Take your FREE Skill Assessment test »
Back to Top
Jose Marcial Portilla has a BS and MS in Mechanical Engineering from Santa Clara University. He has a great skill set in analyzing data, specifically using Python and a variety of modules and libraries. He hopes to use his experience in teaching and data science to help other people learn the power of the Python programming language and its ability to analyze data, as well as present the data in clear and beautiful visualizations. He is the creator of some of most popular Python Udemy courses including "Learning Python for Data Analysis and Visualization" and "The Complete Python Bootcamp". With almost 30,000 enrollments Jose has been able to teach Python and its Data Science libraries to thousands of students. Jose is also a published author, having recently written "NumPy Succintly" for Syncfusion's series of e-books.
Back to Top