This section provides information
about restarting load plans after a failure. It includes the following
sections:
- Overview of Load Plan Restartability
- About Restartability Grain
- Restarting Load Plans
- Troubleshooting Load Plans
- Alternate Options for Restarting Load Plans
- Related Features and Considerations
When you run ETL to load data from a
source system into the Oracle Business Analytics Warehouse (OBAW), it is
possible that you may need to restart the ETL load after a failure. This
section details all such cases, explains the options available for restart, and
describes the implications of using each of those options.
Examples of circumstances and
reasons for load plan failure include:
- Problem with access either to source or target database due to network failure or expired or otherwise incorrect user names and passwords.
- Failure of ODI agent.
- Problem with space or storage. Able to connect to source or target database but the query fails to run due to lack of temp space, disk space, and so on. For files it could due to inadequate space where the file needs to be placed.
- Problem with data, for example incorrect data with lengths larger than the target column can hold, or null values in Not Null columns.
After such a failure during ETL, to
avoid restarting the entire load plan after a failure, which would require
inefficient re-runs of all ETL tasks, you must restart the load from the same
point in its execution once the cause of failure has been diagnosed and resolved.
When you restart a load plan after a
failure, you may not restart again from the exact point of failure, depending
on where it occurred and on dependencies between load plan steps. The point of
restartability is that the end result of the load plan execution is the same
regardless of any load plan failure.
The following example describes one
such dependency-driven requirement for re-running a step which has already
completed: In a load plan with two steps, the first step truncates the table
and the second inserts records into the table, intermittently committing the
records. The load plan is run and fails at the second step due to a space
issue. After the issue is resolved, restarting the load plan from the second
step would be incorrect because the target has some inserted rows. Restart
should instead begin with the first step so that the target table is truncated
again and newly inserted data does not cause any duplicates.
To maintain data integrity in the
case of restart, the grain would vary depending on the location in the step
hierarchy of the failed step and on the Restart setting for the step in the
Load Plan Manager.
Within the Steps Hierarchy in Load
Plan Manager, you can view and edit the Restart setting of a step in the
Restart column. The default settings for different steps in the hierarchy
support data integrity in restarts:
- Root steps are set to 'Restart From Failure' if Serial and 'Restart from failed Children' if Parallel.
- Sub steps are set to 'Restart From Failure' if Serial and 'Restart from failed Children' if Parallel.
- Scenario steps are set to 'Restart from Failed Step'
The below examples highlight the
implications for each type of load plan step.
Serial steps are represented by a horizontal
icon in the Steps hierarchy in Load Plan Manager and by default have a Restart
setting of Restart from Failure. In a case where the load plan fails when
running such a step to load a Dimension Group with multiple serial sub-steps
loading individual dimensions, the load plan on restart would start from the
individual sub-step that failed. Any successfully completed serial sub-steps
would not be run again.
Parallel steps are represented by a
vertical icon in the Steps hierarchy in Load Plan Manager and by default have a
Restart setting of Restart from Failed Children. In a typical run, a parallel
step with five parallel sub-steps under it would have all five sub-steps
executed in parallel, subject to free sessions being available. If two of those
five steps completed and then the load plan were to fail, when the load plan
was restarted all the steps that did not complete or failed would be started
again.
At the lowest order in any load plan
are the scenario steps. While the parent steps, whether serial or parallel, are
used to set the dependencies, the scenario steps are those which load the
tables. A scenario step in turn could have one or more sub-steps, corresponding
to the number of steps inside the package.
In the case of a scenario step
failure during execution, consider that the scenario step may have multiple
steps, all under the same session in Operator log but identified with different
step numbers: 0, 1, 2, and so on. In the case of restart, the scenario would
execute from the failed parent scenario step, re-running all sub-steps.
Note:
If you use the Load Plan Generator
to generate any load plan it would automatically conform to the above standard.
If you are manually altering a generated load plan or creating a new load plan
without using Load Plan Generator, then you should ensure that you conform to
the above standard.
Use ODI Studio or ODI Console to
restart a load plan. This section describes how to restart load plans. It
includes the following sections:
- Restarting Using ODI Studio
- Restarting Using ODI Console
Follow this procedure to restart a
load plan using ODI Studio. The restart option is enabled only on the last run
for a load plan. A load plan can be restarted any number of times and each time
it progresses from the last failure.
To restart a load plan using ODI
Studio:
- In ODI Operator, navigate to the Operator log and select the last failed run for a load plan.
- Double-click the load plan run and select the Restart Option. You can also right-click the last run in the Operator log and select Restart.
Follow this procedure to restart a
load plan using ODI Console. The restart option is enabled only on the last run
for a load plan. A load plan can be restarted any number of times and each time
it progresses from the last failure.
To restart a load plan using ODI
Operator:
- In ODI Console, navigate to Runtime > Sessions/Load Plan Executions and select the load plan execution that has failed.
- Click the Restart button. The Restart button is displayed only when the selected load plan is the most recent run of the load plan.
A load plan must be restarted when
it has stopped with an error. An alternate case where restart may be required
is when a load plan is not doing anything at all, for example when a load plan
is executed and nothing has changed after 30 minutes. The following checklist
can be used to assist in troubleshooting a non-responsive load plan.
1)
Check the maximum
number of sessions set to run against the agent. In ODI Operator, verify that
the number of sessions running is equal to the maximum. If so, then the other
sessions are waiting for the running sessions to complete. Proceed to the next
step.
2)
Clean out stale
sessions. Stale sessions are sessions that are incorrectly left in a running
state after an agent or repository crash. If an agent crashes or loses its
connection the repository after it has started a session, it is not be able to
update the status of the session in the repository, and such a session becomes
stale. Until the stale session is cleaned it shows up as running in the
repository but actually is not.
Stale sessions are
cleaned in multiple ways. Some examples are listed below:
o
You can manually request specific
agents to clean stale sessions in Operator Navigator or Topology Navigator.
o
Stale sessions are cleaned when you
restart an agent.
o
When an agent starts any new
session, it checks for and resolves stale sessions. However, if the agent has
lost connection to the repository, then it cannot clean stale sessions.
3)
Check if the agent
is alive. To test the agent to see if it is running and still has a connection
to the repository, open it in the Topology Navigator in ODI Studio and select
the Test tab. If the agent test fails, then restart the agent after fixing the
issue.
4)
Verify that the
ODI Repository and the server hosting it are running and have not experienced a
failure.
5)
If your load plan
is in error and you have verified all of the above, then restart the Load plan.
This section describes alternate
ways to approach restarting failed load plans. It includes the following
sections:
- Using Mark as Complete
- Running a Scenario Standalone
In most cases the load plan restart
method described earlier in this section is the recommended approach. This
approach ensures data integrity and leaves no scope for manual error. However,
at times you may want to run a load plan step manually. For example, if a step
is inserting duplicate records which are causing failure, rerunning the step
would still insert duplicates. In such a case, you may need to manually correct
the data outside of the load plan and then skip that step when you restart the
load plan. For this kind of situation, you can use the Mark as Complete option.
When you mark a load plan step as
complete, it ensures that when the load plan is restarted, the marked step is
not executed. It is then the responsibility of the person making this setting
to ensure that the load for that step is carried out outside the load plan.
To mark a step as complete,
right-click the step and select Mark As Complete. This can be done at the
scenario step or at any step higher than that.
Marking a step complete at a higher
level in the step hierarchy would mean that none of the child steps under that
parent step would be executed upon load plan restart, even if they are
otherwise eligible. For this reason, marking a step as complete should be
treated as an advanced task and must be done only with a full understanding of
its impact. There is no single recommendation that pertains in all cases, so
the setting must be done carefully and only on a case-by-case basis.
When you are monitoring a load plan,
you may not completely know how to fix a scenario step failure, but may wish to
use the 'mark as complete' option for the failed scenario step instead of
waiting for complete resolution. This prevents a step failure from precluding
an entire load plan completing, while allowing you to inform the ETL team about
the failed scenario step and work on a resolution. The ETL team might then fix
the scenario and want to run it standalone outside the load plan to complete the
load.
As in marking a step as complete,
running a scenario standalone should be treated as an advanced task and the
person running the scenario must be aware of the following:
- A scenario run outside of a load plan by itself invokes the Table Maintenance process. This could, depending on the setting, truncate the table before the load.
To understand
this, consider that when a scenario is run inside a load plan table maintenance
tasks are carried out as explicit steps (the parent step name would be either Initialize
or Finalize). The scenario by itself does not call the Table Maintenance
process when run from within the load plan. Rather, this is controlled by the
EXECUTION_ID variable, which is set to the load plan instance ID. If this
variable has a value greater than 0 when a scenario is run, the Table
Maintenance process is not invoked, as would be the case when a scenario is run
from within a load plan with an instance ID. However, if this variable does not
have a value greater than 0, then the scenario invokes the Table Maintenance
process. This is the case when a scenario is run outside the load plan. If you
set a value for the EXECUTION_ID when invoking the scenario from outside a load
plan, the table maintenance steps would not be called.
- A scenario step could have many variable values set, either dynamically in the case of a refresh variable or explicitly by overriding its value at that scenario step in the load plan. When running a scenario outside the load plan, all the scenario variables would have only their default values. For this reason, care should be taken to set the variables appropriately before calling a scenario from outside the load plan. You can check the variable values that are present in the load plan by looking at the Operator log, provided the log level was set to 6 when the load plan ran. The Oracle BI Applications Configuration Manager uses Oracle Diagnostic Logging. For information about managing log files and diagnostic data, see Oracle Fusion Middleware Administrator's Guide.
This section lists some of the
Oracle BI Applications features that are related to restartability and
describes some related considerations.
If a scenario step is failing due to
bad source data, it may sometimes be desirable to enable the CKM option to load
the valid records and route the error records to a separate table. Examples of
situations where this may be appropriate are the load of null values when they
should have a value or data lengths longer than allowed target column lengths.
Once the load completes, you could correct the erroneous data and have it
automatically picked up in a subsequent load.
Note:
Use of CKM can slow the load
considerably because every record and column could potentially be checked
before loading. For this reason, this is not an option that you want to turn on
across the entire load plan.
Consider a case where a load plan is
started and fails at a scenario step. You fix the issue and regenerate the
scenario, then restart the load plan and may expect it to pick the new
scenario, but this is not what happens. If a load plan has been started and a
scenario regenerated, the regenerated scenario code is not picked up when the
load plan is restarted. To force the regenerated scenario to be picked up, you
have two options:
- Start a new load plan run, accepting the overhead associated with restarting the load from the beginning.
- Run the regenerated scenario as stand-alone outside the load plan, marking that scenario step as complete in the load plan before restarting the load plan.
-
Restarting Long Running JobsConsider a case where you have a scenario that takes two hours to run. The scenario fails at the insert new rows step, after loading the C$ and I$ steps. On restart, the scenario attempts to reload the C$ again. You instead want it to restart from the insert new rows steps only.This use is not supported. The restartability mechanism has been put in place in such a way that restarting a load plan is all you need to do. You do not need to clean up any data in between load plan executions, because the data is committed to the target table only after all the Knowledge Module steps are successful. If the load fails before complete success, no data is committed to that specific target table as part of the failed session. (Note: C$ and I$ tables are created afresh on restart and hence data to these tables would be committed in between).Note:New C$ and I$ tables are created on restart and hence data to these tables would be committed in between load plan start and restart.On restart, new C$ and I$ tables are created, but since the previous load did not complete, these tables from the previous session are not dropped. The load plan generated using Load Plan Generator has a step at the end of the load plan called Clean Stale Work Tables which takes a variable called ETL_DTOP_STG_OLDER_THAN_DAYS whose default value is 30 days. When this step runs, it drops any C$ and I$ tables that are older than the specified variable value.Note:C$ and I$ tables are useful tables when you want to debug a failed scenario. It might not be advisable to set the ETL_DTOP_STG_OLDER_THAN_DAYS value as too small—for example, one day—as you might lose valuable information for debugging.
No comments:
Post a Comment