Introduction to Symlearn
Main Concepts
Symbolic regression is a method that is concerned with identifying the mathematical form that generates output variables using a set of independent inputs through a particular system. The process starts by searching in the space of the different functional expression spaces and combining them to form a mathematical model. In contrast with other regression methods that have a predefined form that is used as an initial form to be optimized.
Symbolic regression methods use different mathematical expressions and combine them to form the structure of the solution. The used expressions include simple mathematical operators (addition, subtraction, multiplication, and division), trigonometric functions (sine, cosine, tangent, . . . , etc.), logical (and, or, not, nor, if-else, . . . , etc.), and others. Besides the functional expressions, it uses also different variables and constants as terminals. While terminals represent coefficients (in case the terminal is a constant) or one of the problem features (terminal is a variable), the functional expressions describe the relationship between the connected functional nodes and terminals. The combination of the functional and terminal sets creates the searching space which is defined by analysts as the following figure shows.
Functional and terminal sets (searching space set)
Individual Representation
In symlearn
, each individual is a program that represents a solution for the problem. symlearn
uses a hierarchical parse-tree structure that consists of connected functional expressions and terminals to represent the individuals as John Koza proposed in his first model the Genetic Programming (GP) as in following figure.
Individual representation in Symlearn
Parse-tree structure is implemented easily as computer programs using the syntax of LISP which is also known as symbolic expression or S-expression. For instance, the computer program visualized in the previous figure is represented using S-expression as:
\[+ (*x3.14) (-7.6y)\]which is equivalent to:
\[(x \times 3.14) + (7.6 \times y)\]Initialization
The optimization process starts by generating initial tree programs randomly using ramped half and half
method which is a combination of two sub-methods; full
and grow
. Using this forming method helps to
increase diversity and avoid similarity in the generated programs. In symlearn
, generate_population
method initializes the first generation.
Operations
Sharing
In sharing
operation, a subtree instance is taken from the brighter tree and glued to an instance of the less bright tree. The subtree’s instance is randomly selected, which increases the diversity in solution trees. The following figure shows an example of sharing
operation.
Sharing operation
Simplification
The simplification
operation merges subtree nodes into one with an equivalent value, and it’s applied only to branches that do not have any variable nodes. The following figure shows an example of the simplification
operation where the expression 1 + 1 is replaced with only one node with a value of 2.
Simplification operation
Substitution
Unlike the sharing
operation which was described previously, the substitution
operation happens in just one tree. In this operation, a random node is chosen and replaced with a suitable one To perform this without changing the structure of this node’s subtree. Nodes are classified into different classes according to their arity: 0-arity nodes that represent terminal nodes, 1-arity nodes like trigonometric, logarithmic function nodes, 2-arity nodes like addition, subtraction, multiplication, and division expressions, or the power function nodes, and so on. In the substitution
operation, the chosen node is replaced randomly with a new
node from the same class.
Substitution operation