This example demonstrates how to find an appropriate temperature range for parallel tempering using the automated analysis workflow from the extensions module. The core subset selection problem from example 1C is used as a case study. Several Metropolis searches with different fixed temperatures are applied. The results are then inspected using the provided R package, to infer an appropriate temperature range to be used for the parallel tempering algorithm.
To find an appropriate temperature range for the parallel tempering algorithm, we will run several Metropolis searches with different temperatures and assess their performance for a series of datasets. Each dataset consists of a distance matrix based on which the entry-to-nearest-entry score of any considered subset is computed (see example 1C). This score is to be maximized. To execute the algorithms and gather statistics about the obtained solution quality, convergence, etc. we will use the analysis workflow included in the extensions module. This requires to add this module as a dependency to your project, in addition to the core module.
The first step is to create an analysis object, specifying the solution type of the problems that will be analyzed.
// initialize analysis object Analysis<SubsetSolution> analysis = new Analysis<>();
Then, the problems are added. In this example there is one problem for each
considered dataset and they all have the same entry-to-nearest-entry objective. Data is read
from CSV files using the CoreSubsetFileReader
which is included in the full
source code
of example 1A where the corresponding data type
CoreSubsetData
is also defined. The implementation of the
EntryToNearestEntryObjective
is described in example 1C.
The size of the constructed core is determined relative to the size of the full dataset by specifying a fixed selection ratio (e.g. 0.2). Every analyzed problem needs to be assigned a unique ID (string). Here, the file name is used as the ID (without directories and without extension).
// set input file paths List<String> filePaths = Arrays.asList("my/input/file1.csv", "my/input/file2.csv", ...); // read datasets CoreSubsetFileReader reader = new CoreSubsetFileReader(); List<CoreSubsetData> datasets = new ArrayList<>(); for(String filePath : filePaths){ datasets.add(reader.read(filePath)); } // create objective EntryToNearestEntryObjective obj = new EntryToNearestEntryObjective(); // set selection ratio double selRatio = 0.2; // add problems (one per dataset) for(int d = 0; d < datasets.size(); d++){ // create problem CoreSubsetData data = datasets.get(d); int coreSize = (int) Math.round(selRatio * data.getIDs().size()); SubsetProblem<CoreSubsetData> problem = new SubsetProblem<>(data, obj, coreSize); // set problem ID to file name (without directories and without extension) String path = filePaths.get(d); String filename = new File(path).getName(); String id = filename.substring(0, filename.lastIndexOf(".")); // add problem to analysis analysis.addProblem(id, problem); }
Ten Metropolis searches are then added to the analysis, each with a unique ID and different fixed temperature,
equally spread across a given interval [minTemp, maxTemp]
. Note that the method
addSearch(id, searchFactory)
requires to specify a search factory.
This is because each algorithm will be executed several times and for each such run a new search
instance is created, using the given factory. The search factory interface is a functional interface
with a single method create(problem)
to create a search that solves the given problem.
A search factory can therefore easily be specified with a Java 8 lambda expression, as in the example
code below. The same stop criterion is used here for every generated search (maximum runtime of 5 minutes).
// create stop criterion long timeLimit = 300; StopCriterion stopCrit = new MaxRuntime(timeLimit, TimeUnit.SECONDS); // set temperature range double minTemp = ... double maxTemp = ... // set number of Metropolis searches int numSearches = 10; // create and add searches double tempDelta = (maxTemp - minTemp)/(numSearches - 1); for(int s = 0; s < numSearches; s++){ // set temperature double temp = minTemp + s * tempDelta; // add Metropolis search String id = "MS-" + (s+1); analysis.addSearch(id, problem -> { Search<SubsetSolution> ms = new MetropolisSearch<>(problem, new SingleSwapNeighbourhood(), temp); ms.addStopCriterion(stopCrit); return ms; }); }
By default, every search is executed 10 times (in addition to a single burn-in run).
The number of runs can be changed using the method setNumRuns(n)
.
// run searches 5 times analysis.setNumRuns(5);
If desired, the number of burn-in runs can also be customized with setNumBurnIn(n)
.
Results of these preliminary runs are discarded. Both settings can also be adjusted at a
search specific level using setNumRuns(searchID, n)
and
setNumBurnIn(searchID, n)
.
Now that the analysis has been set up, we are ready to run it.
AnalysisResults<SubsetSolution> results = analysis.run();
"analysis"
so that they can be filtered from other messages when using a marker aware logging
framework (such as logback). To pick up log messages, follow
these instructions.
The results of the analysis are packed into an object of type AnalysisResults
.
For every combination of an applied search and analyzed problem, the progress of each performed
search run and corresponding final best found solution are reported. The results of
a specific run can be retrieved with
SearchRunResults<SubsetSolution> run = results.getRun("problemID", "searchID", i);
using the IDs that have been assigned to the problem and search during setup, where i
indicates the search run index (a value in [0, n-1] when n search runs have
been performed). From the returned object, the times at which a new best solution
was found and the corresponding evaluation values can be retrieved, as well as the final best solution
obtained in this search run.
Although the results can be directly accessed as described above, the easiest way to inspect them is through the provided R package. To do this, first write the results to a JSON file.
results.writeJSON("sweep.json", JsonConverter.SUBSET_SOLUTION);
The second argument specifies how to convert solutions to JSON format. For subset selection problems,
the predefined JsonConverter.SUBSET_SOLUTION
can be used, defined as
sol -> Json.array(sol.getSelectedIDs().toArray())
Given a subset solution, the predefined converter creates a JSON array containing the IDs of the selected items. To write your own JSON converters, take a look at the mjson library. If no converter is given, the output file will not contain the actual best found solutions but only report the progress in terms of obtained evaluation value over time.
The obtained JSON file can be loaded into R using the provided R package. The package can be installed from CRAN with
> install.packages("james.analysis")
and loaded with
> library(james.analysis)
Running the analysis with a minimum and maximum temperature of 10‑8 and 10‑3, respectively, on
4 datasets
(one coconut, two maize and one pea collection)
produced the file
sweep-0.001.json.
It can be loaded into R with the function readJAMES
.
> sweep <- readJAMES("sweep-0.001.json")
Printing a summary with
> summary(sweep)
reveals the 4 analyzed datasets and the 10 applied Metropolis searches. For each combination of an analyzed problem and applied search, the summary reports the number of executed search runs and the mean and median value obtained in those runs, together with the corresponding standard deviation and interquartile range.
Problem: Search: Runs: Mean value: St. dev: Median: IQR: --------------- ------- ----- ----------- -------- -------- -------- coconut MS-1 5 0.57 0.000474 0.57 0.000521 coconut MS-2 5 0.564 0.00019 0.564 0.00034 coconut MS-3 5 0.545 0.000813 0.545 0.000706 coconut MS-4 5 0.53 0.000772 0.53 0.000979 coconut MS-5 5 0.522 0.000752 0.522 0.00117 coconut MS-6 5 0.515 0.000861 0.515 0.00109 coconut MS-7 5 0.512 0.000811 0.512 0.000549 coconut MS-8 5 0.509 0.000498 0.509 0.000855 coconut MS-9 5 0.506 0.000657 0.505 0.000862 coconut MS-10 5 0.504 0.00131 0.505 0.00148 maize-accession MS-1 5 0.576 0.000397 0.576 0.000271 maize-accession MS-2 5 0.578 0.0001 0.578 8.32e-05 maize-accession MS-3 5 0.571 0.000268 0.571 0.000137 maize-accession MS-4 5 0.564 0.0012 0.564 0.00139 maize-accession MS-5 5 0.556 0.00113 0.556 0.00059 maize-accession MS-6 5 0.55 0.000902 0.55 0.00131 maize-accession MS-7 5 0.545 0.000872 0.545 0.000276 maize-accession MS-8 5 0.542 0.000581 0.542 0.000705 maize-accession MS-9 5 0.537 0.000535 0.537 0.000958 maize-accession MS-10 5 0.535 0.00135 0.534 0.00226 maize-bulk MS-1 5 0.43 0.000101 0.43 4.02e-05 maize-bulk MS-2 5 0.43 8.94e-16 0.43 9.99e-16 maize-bulk MS-3 5 0.43 8.36e-16 0.43 9.99e-16 maize-bulk MS-4 5 0.43 0.000104 0.43 7.05e-15 maize-bulk MS-5 5 0.429 0.000276 0.429 0.000265 maize-bulk MS-6 5 0.428 0.000721 0.428 0.000548 maize-bulk MS-7 5 0.426 0.000617 0.426 0.000857 maize-bulk MS-8 5 0.422 0.000492 0.422 0.000383 maize-bulk MS-9 5 0.418 0.000931 0.418 0.00143 maize-bulk MS-10 5 0.415 0.00188 0.414 0.00204 pea-small MS-1 5 0.353 0.000761 0.353 0.00135 pea-small MS-2 5 0.346 0.000301 0.346 0.000285 pea-small MS-3 5 0.322 0.00106 0.322 0.00123 pea-small MS-4 5 0.303 0.00112 0.302 0.000566 pea-small MS-5 5 0.291 0.00107 0.291 0.00189 pea-small MS-6 5 0.281 0.0016 0.281 0.000265 pea-small MS-7 5 0.274 0.00062 0.274 0.000846 pea-small MS-8 5 0.268 0.00113 0.268 0.00129 pea-small MS-9 5 0.263 0.000811 0.263 0.000542 pea-small MS-10 5 0.26 0.00182 0.26 0.00128
To get a better overview of how the temperature influences the solution quality, we can draw box plots with the function boxplot
.
> temp.range <- seq(1e-8, 1e-3, length.out = 10) > x <- "Temperature" > boxplot(sweep, problem = "coconut", names = temp.range, xlab = x) > boxplot(sweep, problem = "maize-accession", names = temp.range, xlab = x) > boxplot(sweep, problem = "maize-bulk", names = temp.range, xlab = x) > boxplot(sweep, problem = "pea-small", names = temp.range, xlab = x)
This produces the following visualizations:
For three of the analyzed datasets, the solution quality consistently drops when increasing the temperature. Only for one of both maize collections (maize-accession) there is a slight increase in performance at a temperature of about 10‑4, followed by a large decrease. These results suggest that we might have set the maximum temperature too high. So let's try again with a lower maximum temperature of 10‑4 (and the same minimum of 10‑8). This produces the output file sweep-0.0001.json.
> sweep2 <- readJAMES("sweep-0.0001.json") > temp.range <- seq(1e-8, 1e-4, length.out = 10) > x <- "Temperature" > boxplot(sweep2, problem = "coconut", names = temp.range, xlab = x) > boxplot(sweep2, problem = "maize-accession", names = temp.range, xlab = x) > boxplot(sweep2, problem = "maize-bulk", names = temp.range, xlab = x) > boxplot(sweep2, problem = "pea-small", names = temp.range, xlab = x)
The box plots now look like this:
When increasing the temperature there is now clearly an initial rise in solution quality, and a reduction of variability across runs, for each analyzed dataset. For all datasets except one maize collection (maize-bulk) this rise is followed by a moderate performance drop, which is not necessarily undesired. In a parallel tempering approach, these warmer replicas increase the power to escape from local optima. Yet the search strategy cannot solely be based on lucky shots so that this range of [10‑8, 10‑4] is a better choice than the first attempt with a higher maximum temperature of 10‑3.
Example 5B actually applies parallel tempering with the chosen temperature range of [10‑8, 10‑4] and compares its performance with that of a basic random descent algorithm.
The complete source code of this example is available on GitHub. You can also download a ZIP file that contains the Java sources of all examples, a compiled JAR (including all dependencies) as well as some input files for in case you want to run any of the examples. To run this example, execute
$ java -cp james-examples.jar org.jamesframework.examples.analysis.ParameterSweep <selectionratio> <runs> <runtime> <mintemp> <maxtemp> <numsearches> [ <inputfile> ]+from the command line. The selection ratio determines the core size. The following arguments specify the number of runs (repeats) that will be performed for each search, the runtime limit (in seconds) of each run, the minimum and maximum temperature and the actual number of Metropolis searches (i.e. number of considered temperatures). The input files (at least one) should be CSV files in which the first row contains N item names and the subsequent N rows specify an N × N distance matrix. The analysis will be repeated for each dataset loaded from the given files. All output is combined into a single file
sweep.json
that
can be loaded in R using the provided R package.