DATAWORKS
DataWorks is an important platform as a service (PaaS) of Alibaba Cloud. It offers all-around services, including Data Integration, DataStudio, Data Map, Data Quality, and DataService Studio. In addition, it provides a one-stop data development and management console to help enterprises mine and explore data value.DataWorks supports multiple computing and storage engines, including MaxCompute, E-MapReduce, Real-Time Compute, Machine Learning Platform for AI, Graph Compute, and Hologres. It also allows you to use custom computing and storage services. As a one-stop platform, DataWorks provides end-to-end big data services, artificial intelligence (AI) development, and data governance.DataWorks simplifies data transmission, conversion, and integration. You can import data from different data stores, convert, analyze, and process the data, and then transmit the data to other data systems.
FEATURES
DataWorks is a cloud-hosted environment.
DataWorks supports multiple node types, such as Batch Sync node, Shell node, ODPS SQL node, and ODPS MR node. It analyzes and processes complex data based on the dependencies between nodes.
DataWorks provides visualized code development.
DataWorks supports monitoring and alerting.
LIMITS
DataWorks only supports Google Chrome 54 or later.
Currently, DataWorks only supports MaxCompute SQL operations.
THEORY
WORKFLOW:
Workflows are abstracted from business to help you manage and develop code based on business demands and improve the efficiency of node management.
Workflows help you manage and develop code based on business demands. They have the following features:
Nodes in a workflow are organized by type.
A hierarchical directory structure is supported. We recommend that you create a maximum of four levels of subfolders.
You can view and optimize each workflow from a business perspective.
You can deploy and manage each workflow as a whole.
You can view each workflow on a dashboard to efficiently develop code.
SOLUTION:
A solution contains one or more workflows.
Benefits:
A solution can contain multiple workflows.
A workflow can be added to multiple solutions.
Workspace members can collaboratively develop and manage all solutions in a workspace.
SQL SCRIPT TEMPLATE:
SQL script templates are general logic chunks abstracted from SQL scripts. They can be reused to enhance the efficiency of code development.
Each SQL script template involves one or more source tables. You can filter source table data, join source tables, and aggregate them to generate a result table based on the requirements of new business.
An SQL script template includes multiple input and output parameters.
NODE:
Nodes are various data operations, for example:
A sync node is used to synchronize data from ApsaraDB for RDS to MaxCompute.
An ODPS SQL node is used to run MaxCompute SQL for data conversion.
Each node has zero or more input tables or datasets and generates one or more output tables or datasets.
Nodes are classified into node tasks, flow tasks, and inner nodes.
INSTANCE
An instance is a snapshot of a node at a certain time point. It is generated when a node is scheduled by the scheduling system or triggered manually.
The instance contains information such as the time when the node is run, the running status of the node, and operational logs.
Assume that Node 1 is configured to run at 02:00 every day. At 23:30 every day, the scheduling system automatically generates an instance of Node 1 that will run at 02:00 the next day. At 02:00 the next day, if the scheduling system detects that all the ancestor instances are run, the system automatically runs the instance of Node 1.
COMMIT
You can commit nodes and workflows from the development environment to the scheduling system.
The scheduling system runs the code specified in the committed nodes and workflows according to their configurations.
SCRIPT
A script stores the code for data analysis. The code in a script can only be used for data query and analysis. It cannot be deployed to the scheduling system and cannot be scheduled.
RESOURCE AND FUNCTION
Resources and functions are concepts in MaxCompute. For more information, see Resource and Function.
The DataWorks console allows you to manage resources and functions. Note that you cannot query resources and functions in DataWorks if they are uploaded through other services such as MaxCompute.
OUTPUT NAME
Under an Alibaba Cloud account, each node has an output name that is used to connect to its descendant nodes.
SCENARIOS
LOG AND BIG DATA ANALYSIS:
Improved work efficiency
DataWorks allows you to synchronize log data to MaxCompute and use SQL statements to analyze and process the log data. This improves your work efficiency.
Enhanced storage efficiency
DataWorks saves the overall costs and improves the performance and stability of storage and computing services.
Simplified use of big data
DataWorks supports multiple open source MaxCompute plug-ins so that you can easily migrate data to the cloud.
REFINED BUSINESS OPERATIONS:
Improved business insights
With the help of MaxCompute, DataWorks allows you to refine business operations on millions of users.
Data-based business
DataWorks helps you effectively analyze and monitor business data to improve your business efficiency.
Quick response to business demands
DataWorks supports business data analysis so that you can quickly process new business demands.
DATA SECURITY MANAGEMENT
Sensitive data identification
DataWorks can automatically identify sensitive data and use tags to classify the data based on custom rules.
Sensitive data de-identification and presentation
DataWorks allows you to set data de-identification rules to de-identify the sensitive information during data presentation.
Risk monitoring of sensitive data operations
DataWorks allows you to monitor data distribution, usage, and export in a visualized manner, and customize risk levels for auditing.
DATA DEVELOPMENT PROCESS
Data development is a process of generating, collecting, storing, analyzing, computing, extracting, presenting, and sharing data.
The data development process involves the following steps:
GENERATE DATA:
Each business system generates a large amount of structured data every day and stores the data in its own databases, such as MySQL, Oracle, and RDS databases.
COLLECT AND STORE DATA:
You can synchronize data from business systems to MaxCompute and use the powerful data storage and processing capabilities of MaxCompute to analyze the data.
The Data Integration service of DataWorks supports various connections. It allows you to synchronize data from business systems to MaxCompute based on the preset recurrence.
ANALYZE AND COMPUTE DATA:
After data synchronization, you can create ODPS SQL and ODPS MR nodes to process data in MaxCompute, and create other data analytics nodes to analyze and mine the data for value.
EXTRACT DATA:
You can export data processing and analysis results to business systems for further processing.
PRESENT AND SHARE DATA:
After data is extracted, you can present the big data processing and analysis results in multiple ways such as reports or a geographic information system (GIS). You can also share the results with others.
Comments
Post a Comment