Parameter Analysis Tools¶

The main use of a modeling tool like Reveal Chromatography is to provide easy and fast ways to explore the space of possibilities, much faster than if only relying on experiments. This document presents two tools to do that.

The Parameter Explorer is designed to explore the impact of one or more parameters on predicted chromatograms and performances. That can be be valuable both for model calibration as well as for process optimization/characterization.

The second tool, the Parameter Optimizer is more focused on the model calibration part of the workflow, specifically designed to find the set of parameter values to best match experimental chromatograms.

Exploring the parameter space: Simulation Groups¶

Reveal Chromatography can be used to create multiple simulations to study the effect of modifying one or more parameters at once, to study their impact on performances (yield, purities, ...) or on chromatograms during model calibration. In the software, these simulations can be created, run and analyzed together and therefore called a Simulation Group.

A simulation group is created around an existing simulation, called the “center point simulation”, and its simulations only differ from that simulation by the set of parameters being explored. For each parameter explored, a set of values to scan is defined, and the group is made of one simulation per combination of scanned parameter values.

There are two types of simulation groups: simulation grids, where scanned parameters take values sampled regularly inside some range, and Monte-Carlo explorations, where parameters take random values within a certain range. The first type of exploration is useful to build a systematic scan of parameter impacts, for example during model calibration or during process optimization. The second type of exploration is useful to model random process that can happen during operation, and is often done during process characterization. A random exploration, with a fixed set of simulations may also be useful when exploring many parameters/dimensions.

Creating a Simulation Grid¶

Building simulation grids allows modelers to explore a portion of the parameter space systematically and regularly. That can be useful during model calibration to find the best combination of model parameters to match experimental results (to that effect, see also Parameter optimizers). Grids can also be useful during process optimization to see what combination of operating parameters would lead to the best yield/purity.

To create a new simulation grid, go to Tools > Parameter explorer:

A new window will open to allow users to select:

• The name of the simulation grid.
• The type of the simulation group should be left as Multi-Param Grid.
• The “Source Simulation”, which is the simulation around which to build the grid.
• One or more parameters for Reveal Chromatography to scan. To begin, click the New parameter scan button at the bottom of the open window (and repeat this procedure as many times as the number of parameters to scan):

Then click the white space under the heading Parameter name to open a drop-down menu of available parameters that users can explore in the simulation grid:

In the drop-down menu, some parameters are listed with a number in between brackets at the end of their name. The number within this bracket corresponds to components of the product (e.g., acidic 1, acidic 2, native). The number 0 corresponds to the cation component:

Once selected, the grid configuration table will automatically display the value of the parameter as set in the center point simulation in the Center value column (in this case 5.0).

To specify the range of values for Reveal Chromatography to scan, double-click the cells under the Low and High headings. Users can adjust the Num values field to specify how many values are to be generated between Low and High and the Spacing field controls whether the spacing between these values is linear or logarithmic:

To scan more than one parameter, for example to explore the impact of changing both sma_nu and sma_ka, users can click the New Parameter Scan button again, and select more parameters. User can also remove a parameter from the list by right-clicking on its row number:

Press OK when finished configuring the grid to create it. The newly created grid will automatically open in the central pane. Users can also find it in the Study Data browser under Analysis Tools > Simulation grids.

Creating a random exploration (PRO ONLY)¶

Instead of a regular grid exploration, users may want to explore a parameter space randomly, either because the number of parameters to explore is too large for a regular grid, or to do process characterization, and build a Monte-Carlo simulation of the possible outcomes of a process, assuming some uncertainties on operating parameters.

To create a Monte-Carlo exploration, the process is very similar to that of creating a simulation grid. Go to Tools > Parameter explorer:

A new window will open to allow users to select:

• The name of the simulation group.
• The type of the simulation group, which should be set to Monte-Carlo exploration.
• The size of the group, that is the number of simulations that should be generated to be part of the exploration.
• The “Source Simulation”, which is the simulation around which to build the exploration.
• One or more parameters for Reveal Chromatography to scan. To begin, click the New parameter scan button at the bottom of the open window (and repeat this procedure as many times as the number of parameters to scan).

For each parameter scanned, users must specify:

• the name of the parameter to scan,
• the distribution to use to generate random values from (more details below),
• the parameters of this distribution.

The two random distributions currently available are Uniform and Gaussian. Users should pick Uniform to specify that all random values should be picked all with equal probabilities within the range specified by the low and high parameters. Users should pick Gaussian if parameter values are not all equally probable, and rather follow a “bell-curve” around a central value (most probable value) and with a certain width (which controls how probable it is that the parameter will differ from the mean value).

For example, during process characterization, a user wanting to model the manufacturing process may decide to model the impact of the wash step pH using a Gaussian distribution because that is a well controlled parameter and most values are close to some target value, but model the impact of the load volume as a Uniform distribution between two values because of some equipment constraints.

When the Uniform distribution is selected, the two distribution parameters are the lowest and highest allowed values, respectively. When Gaussian is selected, the two distribution parameters are the mean (target value) and the standard deviation (from that mean), respectively.

Press OK when finished configuring the group to create it. The newly created group will automatically open in the central pane. Users can also find it in the Study Data browser under Analysis Tools > Monte-Carlo explorations.

Running the simulation group¶

Once open in the central pane, users can view the group’s name, type, the number of simulations it contains, and the name of the simulation it was built from. Below this information is a table displaying parameter values to be explored. The table will eventually display performance data including yield, pool concentration, pool volume, and component purities once it has run.

To run the CADET solver on all simulations in the group, click the Run Simulation Group button, or right-click on the simulation group in the study data browser and select Run all simulations. As the simulations run, the group’s performance data table will begin to populate:

Note

If a parameter value being tested is unreasonable, for example leading to a chromatogram where the stop collection criteria is never reached, the table will not update that portion of the grid, and will continue to display a missing value (“nan”) in the corresponding row.

Once the group has fully run, the performance part of the table will be filled, and the status will go from Running... to Finished running.

Note

To explore the data table, users may sorted by any of a number of values by using the drop-down menu above the grid:

More exploration capabilities are available in the data analyzer (including filtering and plotting), available using the Analyze Data button. More on this tool in Analyzing Simulation Group results (PRO ONLY). For more custom explorations, the group data can also be exported as a .csv file selecting the Export Data to CSV button located directly below the table.

Plotting the grid’s simulations¶

To visualize the effect of the scanned parameter(s), users can plot chromatograms from all the simulations of the grid by right-clicking on the simulation grid in the study data browser (under Analysis Tools) and selecting Plot all simulations:

This will produce a plot of all the simulations together with the experimental data the center point simulation was built from, if any:

If a specific simulation is needed rather than chromatograms from all simulations in the grid, users can right-click on the desired simulation in the Study Data browser, and select Plot Simulation.

Analyzing Simulation Group results (PRO ONLY)¶

If the simulation group was done to explore the parameter space rather than model calibration, instead of visualizing chromatograms, users may want to explore/summarize/plot the performance data computed in the group. For example, one may want to build a response plot to see how, say, the purity changes with the scanned parameter(s), visualize the distribution of yields over an entire Monte-Carlo exploration, or compute some statistics (for example its 0.5% and 99.5% percentile to see what is the predicted 99% range of values). Or maybe, one needs to filter the data to find out what set of conditions leads to a drop in yield.

All this, and more, can be done directly inside the application, using the new Data Analyzer. The tool is available as a button below the data table in the Simulation Group central pane view:

The Data Analyzer has two parts, each in a different tab. The first tab contains exploration tools, while the second tab contains plotting tools.

Data exploration¶

In the first tab, the group data can be visualized (top half), and each column is summarized in the bottom half table. For each column (explored parameters and simulation performances), their min, max, mean, standard deviation, and certain percentiles are displayed:

Note

Note that the separator between the top part (Data) and the bottom part (statistics) can be moved up and down as needed.

Note

The list (and order) of statistics being computed can be changed using the Show summary controls checkbox at the bottom. The summary statistics table will automatically update.

Additionally, the data can be sorted along any column and filtered in a custom way. For example, the data above shows a minimum of 91.03%. One may want to only display the part of the data where the yield (called step_yield) is, say, below 91.5%, to see what wash pH and load volume can lead to these lower yields. That can be done using the filter, by typing step_yield < 91.5 in the filter box:

As you type the filter, the data table will update to only show the simulations that satisfy the filter criteria. Simultaneously, the statistics will update to be recomputed on the remaining (filtered) simulations. Inspecting the remaining simulation, it seems like their common trait is a very small load volume. Plotting, described in the next section, may confirm that observation.

Note

Note that filters can be combined using and and or keywords and parenthesis if needed. For example, step_yield < 91.5 or purity_acidic_2 > 9.904 is a valid filter:

Data plotting¶

The Data Analyzer currently includes some plotting capabilities, set to be expanded in future versions. It is currently possible to visualize any column(s) using the following types of plot:

• histogram
• line plot
• scatter plot
• heatmap (for regular grids only)

When switching to the Plotting Tools tab of the Data Analyzer, one can see that it is made of two parts: a control table that lists the plot content and a plotting dashboard below that can contain any number of plots side by side.

Histograms¶

Histograms allow users to visualize the distribution of data, whether scanned parameters to make sure they were sampled as expected (uniformly or following a “bell-curve”) or performance parameters, for example during a process characterization, to see how purities or yields spread.

Let’s do exactly that with the Monte-Carlo exploration data generated in Creating a random exploration (PRO ONLY) and described in the previous section. Click the button at the bottom to create a new plot and select the histogram type:

Then select the column of data to display the distribution of:

Users can optionally customize the future plot by changing the x-axis title or change the styling (color, text size, number of histogram bins, ...) of the plot using the “Plot Style” tab:

Once clicking OK, the resulting plot appears in the plotting dashboard:

A corresponding entry appears in the plot control table above, allowing users to edit the plot’s title, axis titles and hide/show the plot in the dashboard.

What we learn from this histogram is that even though our parameter ranges lead to a yield mostly likely around 94% or above, the distribution is long tailed toward lower values, and therefore some portion of the parameter space can lead to values in the 91% range.

How does the purity look like? Let’s make another histogram for it (selecting pink for the plot color this time):

Other plot types¶

Now that we know the ranges and probabilities of yields and purities, let’s explore the impact of each parameters on these performances. If the values were sorted (and regular), a line plot could make sense, but for now let’s add to our dashboard a couple of scatter plots containing the response of yield to both scanned parameters in the same dataset:

These new plots were added using the same strategy as for histograms, but selecting Scatter plot as the plot type. In this case, a column for both x and y axes have to be selected, the styling tab allows to control marker type, color, and size among other things.

Note

Note that in the plot list, the second column allows to hide/show plots, so in the screen shot above, we hid the purity histogram.

These response plots confirm what we had guessed when filtering the data: the load volume has a stronger impact on the step yield than the pH of the wash step.

Note

If a plot is not useful anymore, it can be deleted by right-clicking on its number in the plot list table, and selecting Delete item.

Parameter optimizers¶

The Parameter Optimizer is a powerful tool that is used to find the optimal values of a set of parameters (binding model, transport model and/or operational parameters) to match one or more experiments. It is designed to simplify and speed up model calibration.

Creating an optimization¶

Reveal Chromatography’s Parameter Optimizer, find the best set of parameter values by minimizing a “cost function” which measures the alignment between the simulated chromatogram predicted by the chosen parameter values and one or more experimental chromatograms.

Within the Parameter Optimizer, there are currently two types of optimizers that can be created. Both types of optimizers implement a grid search-based algorithm, with a single step for the Grid-search optimizer and with multiple steps for the (experimental) Self-refining binding model optimizer (Experimental).

The idea is to define a set of parameters to scan, create a grid of simulations that test these parameters, run all simulations in that grid, compute the “cost” for each simulation (meaning the distance/difference between the resulting chromatograms and the experimental chromatograms). This workflow can be done subjectively when using the Parameter Explorer tool (see Exploring the parameter space: Simulation Groups). The parameter optimizer automates that workflow.

Despite its simplicity, the approach has two major advantage over smarter optimizers that are under development:

1. Parallelizable: A grid search approach is automatically parallelizable, meaning that simulations in the grid are independent from each other, and can be run at the same time, yielding speed on modern computers and clusters.
2. Reliable: A grid search approach can never fail to converge to a solution, no matter what sets of parameters are being optimized, unlike other algorithms, which might rely on assumptions about the cost function’s smoothness, regularity, ....

The major limitation of a grid search approach is the amount of time and memory that is required which increases exponentially with the number of parameters explored/optimized. Other types of optimizers are under development for that reason. Stay tuned.

To create a Parameter Optimizer, begin by selecting Parameter optimizer from the Tools menu:

A new window will open to configure the optimizer. The first task for users is to select experimental data simulations will be compared to:

Note

It is possible to optimize some parameters across more than one experiment for a single product. Select multiple experiments by holding down the Ctrl key while clicking on each experiment.

The rest of the optimizer configuration depends on the type of optimizer selected.

Grid-search optimizer¶

Users may choose the type of optimizer to run. The most general and default type is the “Grid-search Optimizer”, which builds a simulation grid for each target experiment, and computes the cost at each point in each grid. The cost of a parameter value is the sum of the costs across all target experiments. Therefore, the parameter value that is considered optimal is the one that provides the best fit across all target experiments.

This optimizer type is the default one:

Next, users must specify a starting simulation from the drop-down menu. The starting point simulation will specify the first and the last method steps to model. Optionally, users may also override the buffers used as initial condition for the column in simulation grids. Leave this blank to have that information read from the experiment’s method.

Finally, users must select the parameters to optimize by pressing the New parameter scan button. In the example below, sma_nu[2] and sma_ka[2] are selected. These scanned parameters are specified in the same way as in Creating a Simulation Grid: boundaries of the parameter space to explore, number of points and type of spacing must be specified:

Clicking Create will create the optimization and add it to the study. The newly created optimizer will automatically open in the central pane. It is also available in the Study Data browser, under the Analysis Tools > Optimizations section.

Self-refining binding model optimizer (Experimental)¶

The self-refining binding model optimizer (experimental) builds a multi-step optimizer, to optimize the SMA binding model parameters ka, nu or sigma, for all product component. The optimizer is still grid based, but avoids building a very large dimensional optimizer, scanning all parameters across all product component. That could lead to a computationally expensive task due to the number of dimensions, especially for products made of many components.

That big of a problem isn’t the smartest way to go about calibrating each product component anyway since the correlations between binding model parameters across component are small.

The binding model optimizer makes use of that, and builds a first step that applies the same value of the scanned parameter (ka, nu and/or sigma) to all components. That optimizer step is called the constant step. From that step, the best value for each component is extracted, and a refining optimizer step for each component is built and run, centered around the best value found during the constant step, and scanning the small portion of the parameter space around it.

Selecting the self-refining binding model optimizer produces a slightly different window for specifying parameters to be optimized. Under “Constant Step Parameters” tab, users have the ability to add parameters one or more of the SMA binding model parameters sma_ka, sma_nu, and sma_sigma. Reveal Chromatography will automatically propose boundaries for the exploration designed to explore the the full range of reasonable values for SMA parameters. Users can modify them by double-clicking on them:

Finally, in the Refining Steps Parameters tab, users can adjust the scanning strategy for the optimizer steps following the constant step, specifically the number of grid points, and the refining grid size:

Clicking Create will create the optimization and add it to the study. The newly created optimizer will automatically open in the central pane. It is also available in the Study Data browser, under the Analysis Tools > Optimizations section.

The central pane view¶

The newly created Optimizer’s central pane view displays the target experiment(s), the list of parameters that will be explored, the total number of simulations that will be run.

Below, there will be a table with the output data that will be collected during the optimization, namely the simulation’s name, the values of scanned parameters, and total cost of resulting simulations, over all target experiments:

That table will be updated at the end of each optimizer step’s run.

To see the details about the parameter scanned, one can expand the optimization steps of the optimizer in the study data browser, and open the desired optimization step in the central pane. That will display information about the precise range of values that were scanned in that step:

Editing the cost function (OPTIONAL)¶

Though optional, editing the cost function can be useful to achieve the best automatic results from a Parameter Optimizer, and the best cost function can depend on the parameters optimized and the specific chromatograms the user is trying to calibrate against.

Reveal Chromatography currently offers only one type of cost function, named “Position/height/back slope”. This cost function is a combination of the distance between the experimental chromatogram and the simulated chromatogram along the following characteristics: peak time location, peak height, and the average of the slope of the back of the peak (between 80% of the peak and 20% of the peak). The cost in Reveal Chromatography is thus roughly computed as:

$\sum_{\rm all~components} weight_{\rm position} * \left | \frac{position_{\rm exp} - position_{\rm sim}}{position_{\rm exp}}\right | + weight_{\rm height} * \left |\frac{height_{\rm exp} - height_{\rm sim}}{height_{\rm exp}}\right | + weight_{\rm slope} * \left |\frac{slope_{\rm exp} - slope_{\rm sim}}{slope_{\rm exp}}\right |$

Once the optimizer has been created, users can customize the cost function’s weights for peak position, height, and back slope. That is done by clicking the View/edit cost function button:

This will open another new window. Click Edit weights/view weights and use the sliders in this window to set the weights at desired values:

By default, the peak time is set as the most important, followed by the peak height, and the peak slope is set as the least important, but these may need to be tweaked depending on the shape and properties of the target chromatogram.

Additionally, it may be useful to change these weights based on where the user is with the model calibration process. At the beginning, peaks may not be aligned at all, and the peak’s back slope isn’t an important parameter. It may become useful to increase its weight once parameters have been optimized to the point where the simulated peak position is close enough to the experimental one, and the user is now more focused on reproducing that back slope to improve the predicted stop collect time or the pool’s volume and other properties.

Running the optimizer¶

To run the optimizer, click on the Launch optimizer button at the bottom left of the central pane view. The Status will change to Running.... Progress will be indicated by the completion level indicator, next to the status:

Once the optimizer completes its run and cost function computation, the Status will become Finished running and the output table above will be populated with cost data, sorting the table by costs by default.

To view optimized parameters/simulations, use the study data browser to navigate to Analysis Tools > Optimizations > OPTIMIZER NAME > Optimal Solutions. There, users can select these optimal simulations, open them in the central pane for review, copy them or their components elsewhere in the study, or plot their chromatograms (see below) and more:

Users can also control the number of optimal solutions to expose in the Study Data browser by modifying the number in the box listed next to “Number of optimal solutions to save”:

It can be useful to increase that number to inspect or plot more simulations, for example to see if the optimizer’s ranking matches the user’s choices.

Note that optimal simulations of an optimizers get stored in the project file. Consequently, it is not required to copy these simulations from that folder into the study to retain them, except maybe to provide a starting point for a future simulation grid. Consequently, to reduce the size of the project file, it is recommended to reduce the number of optimal simulations to the minimum needed by the user.

Plotting the optimized solutions¶

Any of the optimal simulations can be plotted (one at a time), by right-clicking on the desired simulation in the Study Data browser and selecting Plot Simulation.

The plot will appear in the central pane, together with its target experiment, like this:

Visualizing the stability of optimal simulations¶

Another way to inspect optimized simulations is to visualize their costs compared to the rest of the parameter space. To do so, select View/Edit cost function in the optimizer view after the optimizer has run. The window will now display, below the cost function’s weights, a 1D plot of the cost of the simulations as a function of (one of) the parameter(s) being optimized:

If the effect of more than one parameter was explored, this 1D plot contains the cost as a function of one of the parameters, the other parameters being set at their lowest values. Sliders for the remaining parameters will appear below the plot allowing users to see the cost at any values of any parameter:

If working with more than one parameter, users can also visualize the performance of optimized simulations on a 2D heat map. To do so, select the 2D option and a heat map plot of the cost as a function of the first two parameters being optimized will appear:

Users can change what parameter is displayed along the x and the y axis, and explore the cost in the desired two dimensions:

By default, the heat map’s color bar doesn’t span the full range of cost values, since high costs correspond to a part of the parameter space that is typically not of interest. The color bar range is controlled by the “Colorbar high percentile” control below the color bar, set by default at 50%. That means that any cost above half of the max cost is represented by the same (high cost) color.

Depending on the situation, it may be useful to increase or decrease that value, to increase/reduce the dynamic range of the plot. For example, to distinguish between several seemingly identical low values (such as on the screenshot above), users may want to reduce the high percentile, so that the same color range corresponds to a smaller range in cost values and more details can be extracted around the lowest costs:

Large-scale computations with Reveal¶

Simulating chromatography processes, is a time-consuming and resource intensive process. In particular, the CADET solver at the core of Reveal computes for every simulation the state of each product component, at each time step in the solution, and at each column spatial step and bead spatial step. This is what allows Reveal to display component concentration states over time and over the column in the Particle Data animation tool.

As a response, Reveal is designed to leverage every bit of resources available. That means that using a more powerful machine to run Reveal will translate into running explorations faster or being able to run larger ones. For real modeling work (model calibration, process optimization, process characterization), the minimum recommended hardware specifications to run Reveal is:

• 16+ GB of RAM
• 8+ CPU cores
• 200+ GB of free hard drive space

Both the amount of RAM and the number of CPUs will impact the speed of a given study/exploration. Too little RAM will lead to disk swapping, and have a dramatic impact on speed. More importantly, the size of available hard drive space will limit the size of the exploration that can be run, and too little space will lead to a failure of the exploration. It is therefore important to pre-compute the needed space before running an exploration.

If compute servers are available, the recommended configuration is:

• Single node, shared memory, large number of CPUs
• Linux Ubuntu/Redhat/CentOS
• SLURM job/resource scheduler if resource is shared

It is important to note that resource requirements (time, RAM and hard disk usages) increase quickly with:

• the size of the exploration,
• the number of product component,
• the time resolution of the simulations,
• the space resolution of the column and bead discretization.

A medium-scale study example¶

As a data point for a large-scale exploration for a process optimization job, (using Reveal version 0.9), a 10,000-simulation grid ran in 2.5 days on a standard MacBook Pro laptop with:

• 16 GB of RAM,
• 8 CPU cores (intel i7, 2.5 GHz),
• 2 TB of available hard drive space.

In this example, the analyzed product contained 5 components and the center-point simulation was set up with 15,000 time points per time step. Each simulation, once run, led to the storage of a 400+MB file. Building a 10,000-simulation grid with that setup would by default use 4 TB of disk space! By comparison, a simulation with the same number of components and the default 2000 points per time step leads to a 45-50 MB file size.

But the ability to delete CADET files along the way allowed to reduce that usage to 2 TB. Ultimately, the time and amount of hard drive space used ended up controlled by the shortage of RAM, and dramatic speed ups and with enough RAM, disk usage is kept under a few GB.

A large-scale study example¶

The largest Reveal study conducted by our group required multiple fine-grained optimizations, each exploring ~200,000+ simulations. It was run on a Linux compute server, with 40 CPUs, 128GB of RAM, which uses the SLURM resource scheduler (requires Reveal version 0.10). That server allowed to blindly calibrate all binding model parameters for 5 product components in ** hours (running a total of * simulations).

Recommendations¶

For large-scale explorations (above 1,000 simulations), it is recommended to:

• Create a small version of the full grid, and monitor the run time per simulation, RAM usage and CADET file sizes. Extrapolate to the needed grid to make sure enough RAM is available and compute the expected run time.
• Review the advanced simulation settings, in particular the solver’s time resolution and the space resolution of the column and bead discretization (see the Show Advanced Parameters button in the center point simulation view).
• Review the Solver settings in the Preferences, to make sure the that product of the number of workers and the number of CADET threads isn’t larger than the number of available CPUs. Otherwise, CPU over-allocation will trigger (sometimes serious) performance degradation. The symptom of this issue is single CADET job times higher during exploration than when a single simulation is run.
• To avoid memory overflow/disk swap, request the simulation groups to store simulations on disk rather than in memory (see the check-box at the bottom of the panel to build new simulation groups).
• To avoid hard disk overflow, delete simulation files along the way, when doing process optimization/characterization, since the purpose of these explorations is to analyze performances. Again, that can be controlled when building new simulation groups (check-box in the bottom right of the builder panel). Note that, depending on the available memory and the number of CADET running CPUs (Edit > Preferences... > Solver > Executor num worker), CADET files may still occupy a significant amount of disk space while waiting to be post-processed, and therefore the exploration may still require a significant amount of disk space.
• Disk usage, memory usage, disk swapping and logs should be monitored closely while running the exploration to anticipate issues and potentially free up more disk space if needed.

Feel free to Contact us for support to run large-scale Reveal explorations.