Build batch & streaming pipelines to ingest & process data with HDInsight
As a Data Engineer, boost your productivity and deliver quality data quickly. With Data Collector, build, test, run and maintain data flow pipelines connecting a variety of batch and streaming data sources and compute platforms.
Build adaptable pipelines with minimal coding and maximum flexibility.
- Easy to use Graphical User Interface for building data flow pipelines.
- In-built capability with transformations to cleanse and enrich data in flight.
- Rapid troubleshooting using data preview, capturing data snapshots and replay functionality.
- 100% in-memory operation for high throughput and low latency.
Monitor pipeline performance and data quality
- Pipelines automatically detect and adapt to data drift as schema and semantics evolve.
- Customizable runtime metrics on dataflow operations and data fidelity.
- Real-time early warning of anomalies and outliers via data introspection, sampling, threshold rules and alerts.
It takes about 20 minutes for the HDInsight cluster to be available. Once the cluster is deployed Data Collector service is started on the edge node.
View the HDInsight cluster in the Azure portal, and then click Applications.
Locate the StreamSets Data Collector for HDInsight Cloud application, and then click Portal in the URI column to access the Data Collector UI.
For more information, please refer to the documentation.