August 2020

Process Control, Instrumentation and Automation

Avoid common data preparation mistakes to improve analytics results

Time series data preparation in process manufacturing applications presents complex challenges, such as differences in data sampling rates, inconsistent or custom units, and the need to access data in multiple systems, among other issues.

Reckamp, J., Seeq Corp.

Time series data preparation in process manufacturing applications presents complex challenges, such as differences in data sampling rates, inconsistent or custom units, and the need to access data in multiple systems, among other issues. Therefore, time series data is very difficult to collate and align for modeling, analytics or other approaches commonly used to create insights.

Modeling relates to a variety of techniques, including regression or clustering algorithms, along with more complex machine learning or artificial intelligence models, such as neural networks or random forest decision trees. Regardless of what type of model is chosen, all of them require data preparation to achieve reliable results.

Subject matter experts (SMEs) know they should document their methods. However, their primary focus is typically the assumptions related to the model itself, such as the algorithm, the training data set or the model’s applicability only over a certain range of input values. Often overlooked are data preparation assumptions, which are critical for deploying a model with high confidence.

Advanced analytics software enables SMEs to address data preparation challenges prior to modeling. This type of software can connect to the raw data, align it, perform calculations and enable collaboration across the organization. Most models are only as good as the training data used to create them, so modelers should focus just as much, if not more, on the preparation of the data set as they would on actual algorithm or model development.

Four leading data preparation issues are examined here, along with suggestions as to how each can be addressed using advanced analytics software.

Improper data sampling/gridding

Before running a model, the data must be aligned. In data science, this is often referred to as gridding, where the data is resampled at a frequency to select the closest or interpolated value for each process signal at that point in time. While the gridding period is often selected based on the length of the training window and the amount of computational time it may take to process a certain number of samples, it is rarely tracked back to the raw data. It is uncommon that models are tested on multiple gridding frequencies to see how the gridding selection impacts the model.

Gridding can drastically impact a model’s data set, and hence the model outcome. FIG. 1 depicts a process data signal of temperature (purple), with gridding at 2-min  (green) and 5-min (red) intervals.

FIG. 1. The purple line depicts raw temperature data, while the other two lines depict the same data sampled over two different time intervals—a form of filtering or gridding.

While the purple signal shows a distinct oscillatory pattern representing all available process signals, the selected gridding period may completely remove indication of oscillation, as shown with the 5-min gridding. This inherent filtering of the data may result in the removal of noise in the model, a positive result, or it may completely miss a key characteristic of the critical variable the model was designed to examine.

If gridding introduces such issues, why not just use every available data point? In most cases, models process many data inputs, so extensive computational power would be required and may not be available, and there is also the high possibility of overfitting models. Overfitting of models occurs when the model begins to describe the noise present in the system rather than the relationships among variables. When a model is overfit, it often produces a distinct difference in the model correlation and accuracy between the training and testing data set, resulting in a poor model.

Furthermore, each data input is often not sampled at the same frequency or time. Therefore, there will always be some signal that is being filtered, unless the most frequent signal is selected as the gridding period.

In this case, all process variability will be captured in the data set, but the validity of the interpolation type must be questioned, especially if the sampling frequency for inputs to the model differ significantly. By default, data historians tend to interpolate between stored data points, typically using a linear slope between samples. When many interpolated data points are being used in the model, one should question whether that linear slope accurately represents the data, or whether some other function, like a filtering algorithm, would better represent the data.

As an added layer of complexity, many historians utilize compression to minimize the amount of stored data. With compression, the data is stored at an inconsistent rate, with a new data point created only when a specific deviation in value is observed, or if a maximum duration has been exceeded. This results in inconsistent data sampling frequencies. If original data timestamps are used as the gridding for the model, then the model is effectively weighted to favor periods where data variability was greater due to the presence of more sample points.

The takeaway is that there is no precise workaround for selecting the perfect gridding period. Therefore, one should always document the assumptions made during the gridding process and understand the differences between the gridded and actual process data, preferably by visually comparing the gridded data set against the raw data—a task made easier through the use of advanced analytics software.

Checking for differences with multiple gridding periods or including additional information into the model (e.g., moving window standard deviation to capture oscillations) are ways to verify that gridding did not adversely impact model results. Practical application of these and other verification methods require the SME to access and inspect the data source. If an SME obtains a data export with aligned timestamps and a consistent sampling period, the data has most likely already been gridded and is not the raw data, so it should be handled with caution.

Advanced analytics software can be used by SMEs to resolve this issue by connecting directly to the raw data sources and providing a visual representation of how various gridding periods accurately capture the data set. With this type of software, SMEs have direct access to the data to test the impacts of gridding frequencies. In addition, filtering algorithms can be applied to the data set to better capture non-linear or transient time periods, or to remove outliers due to faulty sensors.

Aligning data by time instead of flow

A common data preparation mistake is aggregating data by timestamp. While it may make sense to compare multiple signals based on an equivalent timestamp, there is an intrinsic assumption built into such a comparison that the process fluctuates simultaneously across all sensors, which is usually not the case. Instead, the best way to align data is usually by material flow.

For example, if one wants to know which process variables impact the quality of a product, they will need to know when the slug of material with a quality measurement passed by each sensor upstream in the process. In a short, rapid process, the difference between aggregating by time and by material flow may be negligible. Conversely, a long process may have hours of delay between measurements made by upstream sensors and those by a downstream quality analyzer.

A pipeline is an extreme example, as it can be thousands of miles long with transported media moving at relatively slow velocities, with multiple sensors installed along its length to measure temperature, pressure and other parameters. The quality of the material may be measured at the end of the pipeline, which could have taken days to arrive from the source.

To know what additives, temperatures or other variables influenced the quality of that product, the process media should be tracked back through the pipe to determine the sensor values when that media passed each additive injection point or sensor—for example, the additive flow rate that was present when this material passed a flowmeter, or the temperature of the material when it passed a temperature probe, rather than simply using measurements made at the end timestamp when quality was analyzed.

In refineries and petrochemical plants, most processes experience some delays between quality and upstream measurements, though typically not to the extreme of a pipeline. Therefore, the data requires some shifting of time to improve the understanding and applicability of the underlying model.

The ideal time shift for data is often what is known as the residence time between sensors, which can be calculated by the volume of the system between each sensor divided by the flowrate of the system. As the flowrate fluctuates throughout the process, this time shift is not a constant value, but a variable fluctuating based on how fast the system is processing material.

Advanced analytics software can be used by SMEs to shift timestamps and align the data based on the material flow through the process (FIG. 2). The ability to delay or shift the data by a dynamic time value or calculated value for residence time enables the model to accurately capture the impact of changes in flowrate in the model.

FIG. 2. Advanced analytics software can be used to align downstream quality measurements with associated upstream data points based on material flow residence time.

Missing the trajectory

Batch processes avoid many of the time shift requirements in the previous assumption, but they come with challenges of their own. During batch processes, the process is often in a transient state. While it may be possible to make some online quality measurements, in most cases, offline quality measurements are required, with samples taken periodically throughout the batch. These batch samples are generally discrete measurements and only a few samples may be taken per batch, so accurate modeling either requires interpolation between data points or significant limitation of the data set.

Batch processes introduce additional complication if the data point at a timestamp is not the important value, but rather the entire trajectory of the batch leading up to that data point. FIG. 3 illustrates this situation by overlaying two batches on the same time scale.

FIG. 3. Using data for a model only at the point indicated by the cursor would miss the trajectory taken to reach that point, often a critical item of information.

If the quality sample measurement was taken at the time of the cursor (approximately 10 hr and 19 min into the batch), both process data measurements would be equivalent at that time period at 93°F (33.9°C). If that was the only data gridded for the model, it would entirely miss the trajectory of the temperatures that the process went through to get to that point.

In the case of a reaction, the differences in temperature profiles during the process leading up to that sampling point would likely impact quality and/or yield, even though the model data set would assume identical input values.

Therefore, it is critical to include some representation of the trajectory when modeling a batch process, as the quality or yield at any point within the batch is often a function of all the data up to that sample point. A wide variety of options exist for incorporating these batch trajectories, ranging from simple statistics (e.g., totaling flowrates to get the total amount of material added to a reactor) to dynamic modeling methods.

In addition, the use of golden profiles or golden batches can help restrict the model to batches that fit a desired trajectory. This is a similar approach to restricting the input parameters of a model to the range available in the training data set.

An SME can utilize advanced analytics software to first contextualize the data and provide information about the batch, operation or phase of interest. This enables the model to focus just on the data relevant to that period of interest, and to calculate key parameters based on those batch time periods.

Dynamic models can be used to capture the batch trajectory by basing the models on calculated parameters, such as the amount of time a batch has been running, or dynamic aggregations of key statistics, such as maximum temperature in the batch or totalized reagent fed into a reactor.

Using raw data without consulting the SME

If the person creating the model is not the SME, collaboration with the SME should always exist to ensure quick and accurate insights, because process data sets are complex. In addition to all the previous points, there are a wide host of other nuances familiar only to SMEs.

The modeler often respects the data as the law and attempts to avoid data manipulation. However, there are times when calculations are required to make the models match physical reality. SMEs likely deal with these and other issues on a regular basis and can help sort through the necessary data preparation steps.

Consulting with the SME to fully understand the process can also result in more productive outcomes. For example, someone with limited understanding of a process might create a model showing an inverse relationship between temperature and pressure in a gaseous reaction. This is most likely not an insight, but instead an obvious known fact to an SME familiar with the process.

SMEs provide a wealth of knowledge about the process, including known correlations among variables that may result in multicolinearity, variables that should be combined due to the synergistic effects in the process or constraints on process changes. For example, a model showing that best quality would be achieved by increasing line speed and decreasing tension is pointless if those two variables are intrinsically linked, or if it is not feasible to adjust these variables.

Facilitating collaboration

Advanced analytics software enables documentation and knowledge capture for collaboration between SMEs and data scientists. This type of software not only provides data scientists with access to the raw data, but also gives them access to information the SME may add to provide context. Examples are which data or batches are abnormal, possible first principles equations that may apply to the model, variable constraints or clarification on the current state of process knowledge.

In addition, advanced analytics software enables near real-time deployment of the model results directly back to the SME to rapidly capture value. The SME can provide feedback on potential modifications of the model and note aspects of processing that are not accurately captured. Meanwhile, the data scientist can continue to adapt and optimize the model based on the most recent and relevant process data, enabling effective collaboration and knowledge capture of the assumptions during the data preparation and model development process.

Data preparation is an important but often overlooked process modeling step. Each process is different, so the modeler should think carefully about the various data preparation assumptions. Determining appropriate gridding of the data, understanding the best method for data alignment, incorporating the trajectory of variables for batch or process run models and collaborating extensively with the SME are all vital steps in building the best process model for process optimization and improvements. HP

The Author

From the Archive



{{ error }}
{{ comment.comment.Name }} • {{ comment.timeAgo }}
{{ comment.comment.Text }}