DMelt:AI/Symbolic Regression

From HandWiki
Member

Symbolic regression

Term "symbolic regression" represents a process during which are measured data fitted by suitable mathematical formula like sin(x)+1/x, etc. This process is amongst mathematician quite well known and used when some data of unknown process are obtained. For long time SR was domain only of humans but for a few last decades it is also domain of computers. Idea how to solve various problems by SR by means of evolutionary algorithms (EAs). Read Symbolic regression article.

Symbolic regression using Jython and Java

Let us perform a symbolic regression using an input array of X and Y. Our goal is to find a simplest function that can fit data using operations -, +, *.

Our target function [math]\displaystyle{ f(x) = x^2 + 10*x }[/math].

# Example: find a target function f(x) = x^2 + 10*x using input data

from com.lagodiuk.gp.symbolic import *
from com.lagodiuk.gp.symbolic.interpreter import *

# build x and y data
x=[0,1,2,3,4,5,6]
y=[0,11,24,39,56,75,96]
data=[]

for i in range( len(x) ):
    data.append(Target().when("x", x[i]).targetIs(y[i]))

# build fitness function
fitness=TabulatedFunctionFitness(data)
engine =SymbolicRegressionEngine(fitness,["x"],[Functions.ADD, Functions.SUB, Functions.MUL, Functions.VARIABLE, Functions.CONSTANT])
engine.evolve(200)

# get answer
bestSyntaxTree = engine.getBestSyntaxTree()
currFitValue = engine.fitness(bestSyntaxTree)
print "Iterations=",engine.getIteration()
print "Fit value=",currFitValue
print "Function=",bestSyntaxTree.print()

The output is

DMelt example: Symbolic regression solver based on genetic programming

To construct a function, the following elements are available:

  • Functions.CONSTANT
  • Functions.VARIABLE
  • Functions.ADD
  • Functions.SUB
  • Functions.MUL
  • Functions.DIV
  • Functions.SQRT
  • Functions.POW
  • Functions.LN
  • Functions.SIN
  • Functions.COS

The Java API Java API of lagodiuk classes to for symbolic regression.

Symbolic Regression example

This is a second approach using different API based on a input configuration file. It provides more choices to perform symbolic regression in Java and the corresponding scripting languages (Jython, jRuby, Groovy).

We will try to to find a best analytic solution of X-Y data given in a numeric form. First, let us create a configuration file "examplefile.conf" in the form. This will be our input. It defines the problem and input data (using one variable).

#
# Polynom x^4 + x^3 + x^2 - x
# The JGAP example
#
presentation: P(4) x^4 + x^3 + x^2 - x (the JGAP example)
num_input_variables: 1
variable_names: x y
functions: Add,Subtract,Multiply,Divide,Pow,Log,Sine
terminal_range: -10 10
max_init_depth: 4
population_size: 1000
max_crossover_depth: 8
num_evolutions: 800
max_nodes: 20
stop_criteria_fitness: 0.1
data
-2.378099   26.567495
4.153756   382.45743
2.6789956   75.23481
5.336802   986.33777
2.4132318   51.379707
-1.7993588   9.693933
3.9202332   307.8775
2.9227705   103.56364
-0.1422224   0.159982
4.9111285   719.39545
1.2542424   4.76668
1.5987749   11.577456
4.7125554   615.356
-1.1101999   2.493538
-1.7379236   8.631802
3.8303614   282.29697
5.158349   866.7222
3.6650343   239.42934
0.3196721   -0.17437163
-2.3650131   26.014963

Then run this code using the class "SymRegression". It tries to find the best possible solution to describe the data.

from jhplot  import *
from jhpro.sregression import *

js=SymRegression("example.conf")
js.run() # run this example
print "Best solution=",js.getBestSolution()  # print best analytic solution

The output of this program is the function:

(((x * x) + (((x * x) * x) + x)) * x) - x

Description of configuration files

This section describe the configuration parameters using the JGAP program. They are implemented by the Hakan Kjellerstrand. A configuration file consists of the following parameters.

<html>

  • #
    ,
    %
    : Line comments; lines that start with the characters "#" or "%" will be ignored.
  • presentation
    : A text which is shown first in the run.
  • num_input_variables
    : Number of input variables in the data set.
  • output_variable
    : The index (0-based) of the output variable. Default is the last variable.
  • variable_names
    : The name of the variables, in order. Default is "V0", "V1", etc
  • data
    : Starts the
    data
    section, where each row is presented per line. The attributes may be separated by "," or some space. Decimal point is a
    .
    (dot).
    If a data row contains a
    ?
    (question mark) in the position of the output variable, then it is considered a "user defined test" and the fittest program will be tested against this data last in the run.
  • terminal_range
    : The range for the
    Terminal
    as
    lower upper
    . Note: Only one Terminal is used.
  • terminal_wholenumbers
    : If the
    Terminal
    should use wholenumbers or not (boolean)
  • constant
    : Define a
    Constant
    with this value
  • functions
    : Define the functions, with the same name as in JGAP (or own defined functions).
  • adf_arity
    : If > 0 then ADF is used. This is somewhat experimental as I am still try to understand how ADF:s works.
  • adf_function
    : The functions used for ADF.
  • adf_type
    : Either double or boolean. If set to boolean, we can use the boolean and logical operators.
  • max_init_depth
    : JGAP parameter
    maxInitDepth
  • min_init_depth
    : JGAP parameter
    minInitDepth
  • program_creation_max_tries
    : JGAP parameter
    programCreationMaxTries
  • population_size
    : JGAP parameter
    populationSize
  • max_crossover_depth
    : JGAP parameter
    maxCrossoverDepth
  • function_prob
    : JGAP parameter
    functionProb
  • reproduction_prob
    : JGAP parameter
    reproductionProb
  • mutation_prob
    : JGAP parameter
    mutationProb
  • crossover_prob
    : JGAP parameter
    crossoverProb
  • dynamize_arity_prob
    : JGAP parameter
    dynamizeArityProb
  • no_command_gene_cloning
    : JGAP parameter
    no_command_gene_cloning
  • use_program_cache
    : JGAP parameter
    use_program_cache
  • new_chroms_percent
    : JGAP parameter
    newChromsPercent
  • num_evolutions
    : JGAP parameter
    numEvolution
  • tournament_selector_size
    : JGAP parameter
    tournamentSelectorSize
  • max_nodes
    : JGAP parameter
    maxNodes
  • scale_error
    : Sometimes the data values are very small which gives small fitness values (i.e. errors), making it hard to get any progress. Setting this parameter will multiply the errors by this value.
  • stop_criteria_fitness
    : If set (>= 0) then the program will run "forever" (instead of
    num_evolution
    ) until fitness is less or equal to the value.
  • show_population
    : This shows the whole population in each generation. Mainly for debugging purposes.
  • show_similiar
    : Shows all the solutions (programs) with the same fitness value as the best solution. Alternative name:
    show_similar
    .
  • similiar_sort_method
    : Method of sorting the similiar solutions when using
    show_similiar
    . Alternative name:
    similar_sort_method
    . Valid options:
    • occurrence
      : descending number of occurrences (default)
    • length
      : ascending length of solutions
  • show_progression
    : boolean. If true then the generation number is shown for all generations when nothing is happening (i.e. no gain in fitness).
  • sample_pct
    : (float) Takes a (sample) percentage of the data set if > 0.0.
  • validation_pct
    : Withheld a percentage of the test cases for a validation set. This fitness of this validation set is shown.
  • show_all_generations
    : Show info of all generations, not just when fitness is changed.
  • hits_criteria
    : Criteria of a hit: if the difference is <= this value, it is considered a hit. The number of non-hits is then used as a fitness measure instead of the sum of errors. Setting this function also shows the number of programs which is <= this value.
  • mod_replace
    : Setting the replacement value of 0 (zero) for the
    ModuloIntD
    function (see above).
  • showResults
    : boolean. If set then all the fitness cases is shown with the output of the fitted program, with difference to the correct values.
  • resultPrecision
    : the precision in the output used in
    showResult
    , default 5
  • error_method
    : Error method to use. Valid options are
    • totalError
      : sum of (absolute) errors (default)
    • minError
      : minimum error
    • meanError
      : mean error
    • medianError
      : median error
    • maxError
      : max error
  • no_terminals
    : If true then no Terminal is used, i.e. no numbers, just variables. Default false.
  • make_time_series
    : Make a time series of the first line of data. The value of
    num_input_variable
    determines the number of laps (+1 for the output variable)
  • make_time_series_with_index
    : As
    make_time_series
    with an extra input variable for the index of the series. (Somewhat experimental.)
  • minNodes: value penalty
    : minimum number of nodes (terminals + functions). If the number of nodes in a program is less than
    value
    then a penalty of
    penalty
    is added.
  • alldifferent_variables: true/false penalty
    : all the variables (terminals) should be different. If there is more than one occurrence of an variable in a program then a penalty of
    penalty
    is added (for each extra variable).
  • ignore_variables
    : (TBW) It would be nice to be able to ignore some variables in the data set. But this is yet to be written.
  • return_type
    : (TWB) This should be the type of the "main" return value. Note: it is now hard coded in the program as
    double/DoubleClass
    .

</html>


Supported function

The program supports many functions using the JGAP library. The "main" type is double so all functions are not applicable:

<html>

  • Multiply
    (double)
  • Multiply3
    (double)
  • Add
    (double)
  • Add3
    (double)
  • Add4
    (double)
  • Divide
    (double)
  • Subtract
    (double)
  • Sine
    (double)
  • ArcSine
    (double)
  • Tangent
    (double)
  • ArcTangent
    (double)
  • Cosine
    (double)
  • ArcCosine
    (double)
  • Exp
    (double)
  • Log
    (double)
  • Abs
    (double)
  • Pow
    (double)
  • Round
    (double)
  • Ceil
    (double)
  • Floor
    (double)
  • Modulo
    (double), implements Java's
    %
    operator for double. See ModuloD for a variant
  • Max
    (double)
  • Min
    (double)
  • LesserThan
    (boolean)
  • GreaterThan
    (boolean)
  • If
    (boolean)
  • IfElse
    (boolean)
  • IfDyn
    (boolean)
  • Loop
    (boolean)
  • Equals
    (boolean)
  • ForXLoop
    (boolean)
  • ForLoop
    (boolean) (cf the double variant ForLoopD)
  • Increment
    (boolean)
  • Pop
    (boolean)
  • Push
    (boolean)
  • And
    (boolean), cf the double variant AndD
  • Or
    (boolean), cf the double variant OrD
  • Xor
    (boolean), cf the double variant XorD
  • Not
    (boolean), cf the double variant NotD
  • SubProgram
    (boolean)
  • Tupel
    (boolean)

    </html>

Configuration file examples

Here we link about 40 examples of configuration files for symbolic regression:


A complete example

A below is Python example which 1) finds analytic function 2) Construct a generic tree as image and show it 2) plot the best suitable function. We will use the same "example.conf" file as before. Note you can read this file either frm a file system or using URL location. We prefer to use the later:

from java.awt import Color,Font
from jhplot  import *
from jhpro.sregression import *

js=SymRegression("http://jwork.org/dmelt/examples/data/jhpro/sregression/poly_test.conf")
js.run()

# create a tree as image
# IView("/home/sergei/a.png")
print "Best solution=",js.getBestSolution()
# js.createTree("/home/sergei/a.png")

data=js.getDataPND() # extract original data for plotting
print data.toString()

#val=js.getValidationPND()
#print val.toString()

c1 = HPlot("Canvas")
c1.setGTitle("Symbolic regression")
c1.visible(1)
c1.setAutoRange()
f1=F1D(js.getBestSolution(),-5,6)
f1.setTitle("best analytic solution")
c1.draw(f1)

r12=data.getP1D(0,1) # extract 0 and 1 columns
r12.setColor(Color.blue)
c1.draw(r12)

Here is the result of this example. We plot the original data and the found symbolic function:

Initializing calculation dynamically

Instead of reading a configuration file, one can initialize the calculation dynamically inside a program or Python macro. Here is a small example where you can input configuration settings and data within a Python (or Java) code:

from java.awt import Color,Font
from jhplot  import *
from jhpro.sregression import *

data=PND("data") # make input data
data.add([-2.378099,26.567495])
data.add([4.153756,382.45743])
data.add([2.6789956,75.23481])
data.add([5.336802,986.33777])
data.add([2.4132318,51.379707])
data.add([-1.7993588, 9.693933])
data.add([3.9202332,307.8775])
data.add([2.9227705 ,103.56364])
data.add([2.9227705 ,103.56364])
data.add([4.9111285,719.39545])

conf=[] # create configuration settings
conf.append("presentation: P(4) x^4 + x^3 + x^2 - x (the JGAP example)")
conf.append("num_input_variables: 1")
conf.append("variable_names: x y")
conf.append("functions: Add,Subtract,Multiply,Divide,Pow,Log,Sine,Sqrt,Cosine,Exp")
conf.append("terminal_range: -10 10")
conf.append("max_init_depth: 4")
conf.append("population_size: 1000")
conf.append("max_crossover_depth: 8")
conf.append("max_crossover_depth: 8")
conf.append("num_evolutions: 800")
conf.append("max_nodes: 20")
conf.append("stop_criteria_fitness: 0.1")

js=SymRegression(data,conf)  
js.run()                       # run it!
# create a tree as image
# IView("/home/sergei/a.png")
print "Best solution=",js.getBestSolution()
# js.createTree("/home/sergei/a.png")

# check you settings
data=js.getDataPND()
print data.toString()

conf=js.getConfigArray()
print conf.toString()

#val=js.getValidationPND()
#print val.toString()

# plot the result
c1 = HPlot("Canvas")
c1.setGTitle("Symbolic regression")
c1.visible(1)
c1.setAutoRange()
f1=F1D(js.getBestSolution(),-5,6)
f1.setTitle("best analytic solution")
c1.draw(f1)

r12=data.getP1D(0,1) # extract 0 and 1 columns
r12.setColor(Color.blue)
c1.draw(r12)

Here is another example where data are generated dynamically:

from java.awt import Color,Font
from jhplot  import *
from jhpro.sregression import *

data=PND("data")
import math
for i in range(20): # generate data using 1/2 x^2 sqrt(x) 
       x=2*i
       y=0.5*x*x*math.sqrt(x)
       data.add([x,y])

conf=[]
conf.append("presentation: 1/2 x^2 sqrt(x)")
conf.append("num_input_variables: 1")
conf.append("variable_names: x y")
conf.append("functions: Add,Subtract,Multiply,Divide,Pow,Log,Sine,Sqrt,ArcSine,Cosine,Exp")
conf.append("terminal_range: -20 20")
conf.append("max_init_depth: 4")
conf.append("population_size: 1000")
conf.append("max_crossover_depth: 8")
conf.append("max_crossover_depth: 8")
conf.append("num_evolutions: 800")
conf.append("max_nodes: 20")
conf.append("stop_criteria_fitness: 0.1")
js=SymRegression(data,conf)
js.run()


# create a tree as image
# IView("/home/sergei/a.png")
print "Best solution=",js.getBestSolution()
# js.createTree("/home/sergei/a.png")

data=js.getDataPND() # extract original data for plotting
print data.toString()

conf=js.getConfigArray()
print conf.toString()

#val=js.getValidationPND()
#print val.toString()


c1 = HPlot("Canvas")
c1.setGTitle("Symbolic regression")
c1.visible(1)
c1.setAutoRange()
f1=F1D(js.getBestSolution(),0,40)
f1.setTitle("best analytic solution")
c1.draw(f1)

r12=data.getP1D(0,1) # extract 0 and 1 columns
r12.setColor(Color.blue)
c1.draw(r12)

Symbolic regression in many dimensions

Simply create a new config file "test2.conf" and run it as before:

# Gelman's example of linear regression
#
# y = sqrt(x1^2+x2^2) 
presentation: Gelman's linear regression problem, solution sqrt(x^2+y^2)
return_type: DoubleClass
num_input_variables: 2
variable_names: z x y
output_variable: 0
num_rows: 40
functions: Multiply,Divide,Add,Subtract,Sqrt,Pow
terminal_range: -4 4
terminal_wholenumbers: true
max_init_depth: 4
population_size: 1000
max_crossover_depth: 8
num_evolutions: 1000
max_nodes: 21
# hits_criteria: 0.1
error_method: meanError
adf_arity: 0
adf_type: double
data
15.68 6.87 14.09
6.18 4.4 4.35
18.1 0.43 18.09
9.07 2.73 8.65
17.97 3.25 17.68
10.04 5.3 8.53
20.74 7.08 19.5
9.76 9.73 0.72
8.23 4.51 6.88
6.52 6.4 1.26
15.69 5.72 14.62
15.51 6.28 14.18
20.61 6.14 19.68
19.58 8.26 17.75
9.72 9.41 2.44
16.36 2.88 16.1
18.3 5.74 17.37
13.26 0.45 13.25
12.1 3.74 11.51
18.15 5.03 17.44
16.8 9.67 13.74
16.55 3.62 16.15
18.79 2.54 18.62
15.68 9.15 12.74
4.08 0.69 4.02
15.45 7.97 13.24
13.44 2.49 13.21
20.86 9.81 18.41
16.05 7.56 14.16
6 0.98 5.92
3.29 0.65 3.22
9.41 9 2.74
10.76 7.83 7.39
5.98 0.26 5.97
19.23 3.64 18.89
15.67 9.28 12.63
7.04 5.66 4.18
21.63 9.71 19.32
17.84 9.36 15.19
7.49 0.88 7.43