Table of Contents
Symbolic regression
Term “symbolic regression” represents a process during which are measured data fitted by suitable mathematical formula like sin(x)+1/x, etc. This process is amongst mathematician quite well known and used when some data of unknown process are obtained. For long time SR was domain only of humans but for a few last decades it is also domain of computers. Idea how to solve various problems by SR by means of evolutionary algorithms (EAs)
 Snippet from Wikipedia: Symbolic regression
Symbolic regression is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity. No particular model is provided as a starting point to the algorithm. Instead, initial expressions are formed by randomly combining mathematical building blocks such as mathematical operators, analytic functions, constants, and state variables.
Symbolic Regression example
We will try to to find a best analytic solution of XY data given in a numeric form. First, let us create a configuration file “examplefile.conf” in the form. This will be our input. It defines the problem and input data (using one variable).
 examplefile.conf
# # Polynom x^4 + x^3 + x^2  x # The JGAP example # presentation: P(4) x^4 + x^3 + x^2  x (the JGAP example) num_input_variables: 1 variable_names: x y functions: Add,Subtract,Multiply,Divide,Pow,Log,Sine terminal_range: 10 10 max_init_depth: 4 population_size: 1000 max_crossover_depth: 8 num_evolutions: 800 max_nodes: 20 stop_criteria_fitness: 0.1 data 2.378099 26.567495 4.153756 382.45743 2.6789956 75.23481 5.336802 986.33777 2.4132318 51.379707 1.7993588 9.693933 3.9202332 307.8775 2.9227705 103.56364 0.1422224 0.159982 4.9111285 719.39545 1.2542424 4.76668 1.5987749 11.577456 4.7125554 615.356 1.1101999 2.493538 1.7379236 8.631802 3.8303614 282.29697 5.158349 866.7222 3.6650343 239.42934 0.3196721 0.17437163 2.3650131 26.014963
Then run this code using the class “SymRegression”. It tries to find the best possible solution to describe the data.
from jhplot import * from jhpro.sregression import * js=SymRegression("example.conf") js.run() # run this example print "Best solution=",js.getBestSolution() # print best analytic solution
The output of this program is the function:
(((x * x) + (((x * x) * x) + x)) * x)  x
Look at other example file here /scavis/examples/data/jhpro/sregression/.
Description of configuration files
This section describe the configuration parameters using the JGAP program. They are implemented by the Hakan Kjellerstrand. A configuration file consists of the following parameters.

#
,%
: Line comments; lines that start with the characters "#" or "%" will be ignored. 
presentation
: A text which is shown first in the run. 
num_input_variables
: Number of input variables in the data set. 
output_variable
: The index (0based) of the output variable. Default is the last variable. 
variable_names
: The name of the variables, in order. Default is "V0", "V1", etc 
data
: Starts thedata
section, where each row is presented per line. The attributes may be separated by "," or some space. Decimal point is a.
(dot).
If a data row contains a?
(question mark) in the position of the output variable, then it is considered a "user defined test" and the fittest program will be tested against this data last in the run. 
terminal_range
: The range for theTerminal
aslower upper
. Note: Only one Terminal is used. 
terminal_wholenumbers
: If theTerminal
should use wholenumbers or not (boolean) 
constant
: Define aConstant
with this value 
functions
: Define the functions, with the same name as in JGAP (or own defined functions). 
adf_arity
: If > 0 then ADF is used. This is somewhat experimental as I am still try to understand how ADF:s works. 
adf_function
: The functions used for ADF. 
adf_type
: Either double or boolean. If set to boolean, we can use the boolean and logical operators. 
max_init_depth
: JGAP parametermaxInitDepth

min_init_depth
: JGAP parameterminInitDepth

program_creation_max_tries
: JGAP parameterprogramCreationMaxTries

population_size
: JGAP parameterpopulationSize

max_crossover_depth
: JGAP parametermaxCrossoverDepth

function_prob
: JGAP parameterfunctionProb

reproduction_prob
: JGAP parameterreproductionProb

mutation_prob
: JGAP parametermutationProb

crossover_prob
: JGAP parametercrossoverProb

dynamize_arity_prob
: JGAP parameterdynamizeArityProb

no_command_gene_cloning
: JGAP parameterno_command_gene_cloning

use_program_cache
: JGAP parameteruse_program_cache

new_chroms_percent
: JGAP parameternewChromsPercent

num_evolutions
: JGAP parameternumEvolution

tournament_selector_size
: JGAP parametertournamentSelectorSize

max_nodes
: JGAP parametermaxNodes

scale_error
: Sometimes the data values are very small which gives small fitness values (i.e. errors), making it hard to get any progress. Setting this parameter will multiply the errors by this value. 
stop_criteria_fitness
: If set (>= 0) then the program will run "forever" (instead ofnum_evolution
) until fitness is less or equal to the value. 
show_population
: This shows the whole population in each generation. Mainly for debugging purposes. 
show_similiar
: Shows all the solutions (programs) with the same fitness value as the best solution. Alternative name:show_similar
. 
similiar_sort_method
: Method of sorting the similiar solutions when usingshow_similiar
. Alternative name:similar_sort_method
. Valid options:
occurrence
: descending number of occurrences (default) 
length
: ascending length of solutions


show_progression
: boolean. If true then the generation number is shown for all generations when nothing is happening (i.e. no gain in fitness). 
sample_pct
: (float) Takes a (sample) percentage of the data set if > 0.0. 
validation_pct
: Withheld a percentage of the test cases for a validation set. This fitness of this validation set is shown. 
show_all_generations
: Show info of all generations, not just when fitness is changed. 
hits_criteria
: Criteria of a hit: if the difference is <= this value, it is considered a hit. The number of nonhits is then used as a fitness measure instead of the sum of errors. Setting this function also shows the number of programs which is <= this value. 
mod_replace
: Setting the replacement value of 0 (zero) for theModuloIntD
function (see above). 
showResults
: boolean. If set then all the fitness cases is shown with the output of the fitted program, with difference to the correct values. 
resultPrecision
: the precision in the output used inshowResult
, default 5 
error_method
: Error method to use. Valid options are
totalError
: sum of (absolute) errors (default) 
minError
: minimum error 
meanError
: mean error 
medianError
: median error 
maxError
: max error


no_terminals
: If true then no Terminal is used, i.e. no numbers, just variables. Default false. 
make_time_series
: Make a time series of the first line of data. The value ofnum_input_variable
determines the number of laps (+1 for the output variable) 
make_time_series_with_index
: Asmake_time_series
with an extra input variable for the index of the series. (Somewhat experimental.) 
minNodes: value penalty
: minimum number of nodes (terminals + functions). If the number of nodes in a program is less thanvalue
then a penalty ofpenalty
is added. 
alldifferent_variables: true/false penalty
: all the variables (terminals) should be different. If there is more than one occurrence of an variable in a program then a penalty ofpenalty
is added (for each extra variable). 
ignore_variables
: (TBW) It would be nice to be able to ignore some variables in the data set. But this is yet to be written. 
return_type
: (TWB) This should be the type of the "main" return value. Note: it is now hard coded in the program asdouble/DoubleClass
.
Supported function
The program supports many functions using the JGAP library. The “main” type is double so all functions are not applicable:
Configuration file examples
Here we link about 40 examples of configuration files for symbolic regression:
A complete example
A below is Python example which 1) finds analytic function 2) Construct a generic tree as image and show it 2) plot the best suitable function. We will use the same “example.conf” file as before. Note you can read this file either frm a file system or using URL location. We prefer to use the later:
Here is the result of this example. We plot the original data and the found symbolic function:
Initializing calculation dynamically
Instead of reading a configuration file, one can initialize the calculation dynamically inside a program or Python macro. Here is a small example where you can input configuration settings and data within a Python (or Java) code:
Symbolic regression in many dimensions
Simply create a new config file “test2.conf” and run it as before:
# Gelman's example of linear regression # # y = sqrt(x1^2+x2^2) presentation: Gelman's linear regression problem, solution sqrt(x^2+y^2) return_type: DoubleClass num_input_variables: 2 variable_names: z x y output_variable: 0 num_rows: 40 functions: Multiply,Divide,Add,Subtract,Sqrt,Pow terminal_range: 4 4 terminal_wholenumbers: true max_init_depth: 4 population_size: 1000 max_crossover_depth: 8 num_evolutions: 1000 max_nodes: 21 # hits_criteria: 0.1 error_method: meanError adf_arity: 0 adf_type: double data 15.68 6.87 14.09 6.18 4.4 4.35 18.1 0.43 18.09 9.07 2.73 8.65 17.97 3.25 17.68 10.04 5.3 8.53 20.74 7.08 19.5 9.76 9.73 0.72 8.23 4.51 6.88 6.52 6.4 1.26 15.69 5.72 14.62 15.51 6.28 14.18 20.61 6.14 19.68 19.58 8.26 17.75 9.72 9.41 2.44 16.36 2.88 16.1 18.3 5.74 17.37 13.26 0.45 13.25 12.1 3.74 11.51 18.15 5.03 17.44 16.8 9.67 13.74 16.55 3.62 16.15 18.79 2.54 18.62 15.68 9.15 12.74 4.08 0.69 4.02 15.45 7.97 13.24 13.44 2.49 13.21 20.86 9.81 18.41 16.05 7.56 14.16 6 0.98 5.92 3.29 0.65 3.22 9.41 9 2.74 10.76 7.83 7.39 5.98 0.26 5.97 19.23 3.64 18.89 15.67 9.28 12.63 7.04 5.66 4.18 21.63 9.71 19.32 17.84 9.36 15.19 7.49 0.88 7.43