Thursday 11 December 2014

Postgres-XC on Lubuntu 14.04 LTS

Although NoSQL data storage is a new trend to storing unstructured data, relational data storage in my view can still be a good choice for such data. It depends how one configure and use the relational table. However, single node relational data storage is unable to scale with the increasing amount of data. If there is a need for your organization to keep historical structured data in a "centralized" data store, then perhaps one choice is to migrate to a scalable solution. One possible choice is Postgres-XC.

Postgres-XC is an open source project cluster solution that provides:
  • Write-scalability
    Write-scalable means Postgres-XC can be configured with as many database servers as you want and handle much more writes (updating SQL statements) which single database server cannot do.
  • Synchronous update
    Synchronous means any database update from any database server is immediately visible to any other transactions running in different masters.
  • Symmetrical update
    Symmetric means you can have more than one data base servers which provide single database view.
  • Transparency
    Transparent means you don't have to worry about how your data is stored in more than one database servers internally.
It is a collection if tightly coupled database components which can be installed in more than one hardware or virtual machines. They store your data in a distributed way, that is, partitioned or replicated way at your choice for each table. Postgres-XC determines where the target data is stored and issue corresponding queries to servers with the target data. Each database server provides uniform data view to your applications. Any database update from any server is immediately visible to applications connecting the database from other servers. This feature is called "synchronous multi-master" capability and this is the most significant feature of Postgres-XC. Details about Postgres-XC can be found at documentation site here.

Single Node Configuration
In this case study, Lubuntu 14.04 was used. You can of course install Postgres-XC that is pre-built not just for Linux OS. Here, I have simply installed Postgres-XC packaged for Ubuntu-based OS. The default installation will create a GTM, a Coordinator, and 2 Datanodes. I would like to make it better by including a GTM-Proxy. The following first provide the steps to creating a single node cluster. Once a single node cluster is up and running, you can simply adapt the configuration to other node in the cluster.
  1. Install from repository
    Use the following command to install from repository.
    sudo apt-get install postgres-xc
    The version of Postgres-XC at of current time of writing this article is 1.1 (Software Packages in "trusty", Subsection database). The default installation come pre-configured with 1 GTM, 1 Coordinator, and 2 Datanode. Data folders for these components are initialized and configuration files are placed accordingly to different folders in /etc/postgres-xc/.

    After installation complete, you can issue the following command to check out the location of Postgres-XC files installed by the package.
    dpkg -L postgres-xc
    Note where the startup script and configuration files in the default installation. Do also take note of the sample configuration file provided. You may need to use them if you mess up and did not make a backup of the original files in /etc/postgres-xc/ subfolders. The startup script of postgres-XC is located at /etc/init.d/postgres-xc.
    ...
    /usr/share/postgres-xc/gtm_proxy.conf.sample
    ...
    /etc/init.d/postgres-xc
    /etc/postgres-xc/datanode
    /etc/postgres-xc/gtm
    /etc/postgres-xc/coordinator
    ...
  2. Initialize GTM-Proxy data folder
    GTM-Proxy is not a mandatory component of Postgres-XC cluster but it can be used to group messages between GTM and cluster nodes, thus reducing workload and the number of packages exchanged through network. This step will put a GTM-Proxy to sit in-between the GTM and the other components (Coordinator and Datanode).

    The following will create a data folder for GTM-Proxy. Take note to execute initgtm by postgres-xc user. postgres-xc user is the default user created by package to run the original set of Postgres-XC components.
    su postgres-xc
    initgtm -Z gtm_proxy -D /var/lib/postgres-xc/GTM_PROXY
    The configuration file for GTM-Proxy gtm_proxy.conf is created after initgtm command under GTM-Proxy data folder /var/lib/postgres-xc/GTM_PROXY. It is up to you to decide whether you want to standardize the way default installation place the configuration files for other components. For me, I decided to copy gtm_proxy.conf to /etc/postgres-xc/gtm_proxy/GTM_PROXY/gtm_proxy.conf, and then create a symbolic link to /etc/postgres-xc/gtm_proxy/GTM_PROXY/gtm_proxy.conf in GTM-Proxy data folder /var/lib/postgres-xc/GTM_PROXY. The data folders for GTM, Coordinator, and 2 Datanodes are located in /var/lib/postgres-xc/. The configuration files for those components are located under different subfolders in /etc/postgres-xc/. For Coordinator, the configuration files are located in /etc/postgres-xc/coordinator/CN. For Datanode 1, its configuration files are located in /etc/postgres-xc/datanode/DN1. For Datanode 2, its configuration files are located in /etc/postgres-xc/datanode/DN2. For GTM, its configuration file is located in /etc/postgres-xc/gtm/GTM. You will see that the startup script /etc/init.d/postgres-xc refer to these folders. Thus, it is easier to modify the startup script by placing the GTM-Proxy data folder under /var/lib/postgres-xc/GTM_PROXY, and GTM-Proxy configuration file under /etc/postgres-xc/gtm_proxy/GTM_PROXY.
  3. Update GTM and GTM-Proxy configuration files
    • /var/lib/postgres-xc/GTM/gtm.conf
      The default GTM listens to port 6666. Coordinator and Datanodes connect to this port. To make less changes to configuration files of Coordinator and Datanodes, make GTM-Proxy listens on port 6666 instead. Here, GTM is changed to listen to port 6668 as suggested in the gtm_proxy.conf.
      Ensure the nodename of GTM is unique if you intend to create standby GTM in your cluster too.
      Ensure GTM listens to '*' address instead of 'localhost'. Here I use GTM-Proxy, so this is to allow other GTM-Proxy or standby GTM nodes in the cluster to connect to this active GTM.
      You will notice that to configure standby GTM, you simply need to change the file gtm.conf.
    • /var/lib/postgres-xc/GTM_PROXY/gtm_proxy.conf
      Change the listening port of GTM-Proxy to 6666 instead. Coordinator and Datanodes will connect to GTM via GTM-Proxy through this port. Point GTM-Proxy to GTM by changing the two parameters gtm_host and gtm_port in GTM-Proxy configuration file.
      Change nodename of GTM-Proxy accordingly to a unique name.
  4. Modify startup script /etc/init.d/postgres-xc
    The following is the modified startup script. You can of course remove this script if you intend to start up the components manually using the postgres-xc user. Just remember to start up the sequence in the right order. If you have multiple nodes in different physical computing nodes, do start up the GTM node first. Coordinators should be the last to start up.
    For shutdown, coordinators should be the first to shutdown. GTM should be the last to shutdown. If you shutdown GTM before coordinator you will encounter problem when shutting down coordinator. I have to use pg_ctl kill QUIT process_id_of_coordinator to terminate it. To forcefully terminate GTM or GTM-Proxy you can use kill command. It is recommended to gtm_ctl command for GTM or GTM-Proxy.
    #!/bin/sh
    set -e
    #set -x -v

    ### BEGIN INIT INFO
    # Provides: postgres-xc
    # Required-Start: $local_fs $remote_fs $network $time
    # Required-Stop: $local_fs $remote_fs $network $time
    # Should-Start: $syslog
    # Should-Stop: $syslog
    # Default-Start: 2 3 4 5
    # Default-Stop: 0 1 6
    # Short-Description: Postgres-XC RDBMS server
    ### END INIT INFO

    PGXC_DATA=/var/lib/postgres-xc
    #PGXC_LOG=/var/log/postgres-xc
    PGXC_RUN=/var/run/postgresql

    . /lib/lsb/init-functions

    if [ -r /etc/default/postgres-xc ]; then
    . /etc/default/postgres-xc
    fi

    # create socket directory
    [ -d $PGXC_RUN ] || mkdir -p $PGXC_RUN

    chown postgres-xc.postgres-xc $PGXC_RUN
    chmod 2775 $PGXC_RUN

    check_if_running()
    {
    status=0
    case "$1" in
    gtm)
    pidfile=gtm.pid
    sleep 2
    ;;
    gtm_proxy) # include gtm_proxy
    pidfile=gtm_proxy.pid
    sleep 2
    ;;
    postgres)
    pidfile=postmaster.pid
    confport=$(grep -si '^port *=' $PGXC_DATA/$2/postgresql.conf |
    cut -f2 -d= | sed -e 's/#.*$//' -e 's/ //g')
    port=${confport:=5432}

    # wait for server to come up, but no more than 10 seconds
    i=1
    [ -z "$3" ] &&
    while [ ! -S $PGXC_RUN/.s.PGSQL.$port ]; do
    i=$((i+1))
    sleep 1
    [ $i -gt 10 ] && break
    done
    ;;
    esac

    # check if it is really running
    if [ -r $PGXC_DATA/$2/$pidfile ]; then
    PID=`head -n 1 $PGXC_DATA/$2/$pidfile` || true
    if [ -z "$PID" ]; then
    status=1
    else
    [ "`ps h -o comm -p $PID`" != $1 ] && status=1
    fi
    else
    status=1
    fi
    return 0
    }

    do_ctl()
    {
    ACTION=$1
    STATUS=0
    status=0
    [ "$ACTION" = "stop" ] && {
    $FAST_STOP &&
    STOP_MODE="-m smart" # modified to smart stop
    }
    for i in 1 2 3 4; do # modified
    # modified the start up sequence
    if [ "$ACTION" = "start" ] || [ "$ACTION" = "status" ]; then
    case $i in
    1)
    TYPE=gtm
    ;;
    2)
    TYPE=gtm_proxy
    ;;
    3)
    TYPE=datanode
    ;;
    4)
    TYPE=coordinator
    ;;
    esac
    elif [ "$ACTION" = "stop" ]; then
    # modified the stop sequence (reverse of start up sequence)
    case $i in
    4)
    TYPE=gtm
    ;;
    3)
    TYPE=gtm_proxy
    ;;
    2)
    TYPE=datanode
    ;;
    1)
    TYPE=coordinator
    ;;
    esac
    fi

    case $TYPE in
    gtm)
    PG_CTL=gtm_ctl
    PG_START=gtm
    ;;
    gtm_proxy)
    PG_CTL=gtm_ctl
    PG_START=gtm_proxy
    ;;
    coordinator|datanode)
    PG_CTL=pg_ctl
    PG_START=postgres
    ;;
    esac
    log_daemon_msg "$2 Postgres-XC ${TYPE}"
    for NODE in $(ls /etc/postgres-xc/$TYPE/); do
    [ -f /etc/postgres-xc/$TYPE/$NODE/run ] && {
    # if [ "$ACTION" = "reload" ] && [ "$TYPE" = "gtm" ]; then
    # ACTION="restart"
    # fi
    [ "$ACTION" = "stop" ] &&
    check_if_running $PG_START $NODE stop
    if [ $status -eq 1 ]; then
    log_progress_msg "$NODE is not running"
    else
    # remove -l option, as starting gtm_proxy has problem
    # log options are specified in configuration file
    start-stop-daemon -c postgres-xc \
    -Sx /usr/bin/$PG_CTL -- $ACTION $STOP_MODE \
    -D $PGXC_DATA/$NODE -Z $TYPE

    # ERRMSG=$(start-stop-daemon -c postgres-xc \
    # -Sx /usr/bin/$PG_CTL -- $ACTION $STOP_MODE \
    # -D $PGXC_DATA/$NODE -Z $TYPE \
    # -l $PGXC_LOG/datanode.log
    # STATUS=$?
    # if $FAST_STOP && [ $STATUS -gt 0 ] &&
    # [ "$ACTION" = "stop" ]; then
    # STATUS=0
    # start-stop-daemon -c postgres-xc \
    # -Sx /usr/bin/$PG_CTL -- stop -mi \
    # -D $PGXC_DATA/$NODE -Z $TYPE \
    # -l $PGXC_LOG/datanode.log || STATUS=$?
    # fi)

    REPORT=$NODE
    [ "$ACTION" = "status" ] && REPORT=$ERRMSG
    log_progress_msg "$REPORT"
    if [ $STATUS -eq 0 ] && [ "$ACTION" = "start" ]; then
    check_if_running $PG_START $NODE
    STATUS=$status
    fi
    fi
    }
    done
    log_end_msg $STATUS
    done
    }


    case "$1" in
    start)
    do_ctl start Starting
    ;;
    # restart)
    # do_ctl restart Restarting
    # ;;
    status)
    do_ctl status Status
    ;;
    stop)
    do_ctl stop Stopping
    ;;
    # reload)
    # do_ctl reload Reloading
    # ;;
    # force-reload)
    # $0 restart
    # ;;
    *)
    # echo "Usage: $0 {start|stop|restart|reload|force-reload|status}"
    echo "Usage: $0 {start|stop|status}"
    exit 1
    ;;
    esac

    exit 0

    Note:
    • The original script /etc/init.d/postgres-xc starts up components in incorrect order. The right startup sequence should be GTM, Datanode, then Coordinator. Here, I have included a GTM-Proxy, so the startup order is GTM, GTM-Proxy, Datanode, then Coordinator. The shutdown sequence should be reverse.
    • As of writing, I have problem starting up GTM-Proxy with -l option. Thus, the startup script has been modified to start all components without -l option.

Note:
  • Ensure every node in cluster has unique name. The name of every Coordinator and Datanode should correspond to the records in pg_catalog.pgxc_node relational table in postgres database.
  • Install Samba so that cluster node can be referred via netbios name instead of actual IP address.
  • Update pg_hba.conf for every Coordinator and Datanode.
  • Ensure coordinator know where to locate Datanodes and other Coordinators. You can start coordinator in restoremode using
    pg_ctl -Z restoremode -D /var/lib/postgres-xc/CN/ start
    then you can update the cluster node information. The following commands will connect to the coordinator at port 5432 and database postgres. It first display all the node information.
    psql -h <IP of coordinator> -U postgres-xc -d postgres
    select * from pg_catalog.pgxc_node order by node_name;
    It is recommend to remove all nodes and insert the records using create node SQL command because create node will automatically create the node_id. The node information must be correct and in sync for all Coordinators. Otherwise, you will have data consistency error.

Multiple Nodes Configuration
This multiple nodes configuration is simply done by cloning and adapting the node configuration files.

Here, I have 4 computing nodes in cluster. 1 GTM-Proxy, 1 Coordinator, 2 Datanodes for each PC. I configure PC1 to start active GTM. The rest of PCs have standby GTM.
  1. First, update the configuration file of each component in all PCs.
  2. Start GTM as normal and then GTM-PROXY. Then start Coordinators in restoremode. Update the node information of Coordinators as follow. Here I have 4 PCs. It is recommended to delete all node information and then recreate them. This will ensure information is in sync with the rest of other Coordinators. Once done, stop Coordinators in restoremode.
    psql -h <IP of coordinator> -U postgres-xc -d postgres
    delete from pg_catalog.pgxc_node; create node "PC1-cn" with (type='coordinator', host='192.168.0.10', port=5432);
    create node "PC1-dn1" with (type='datanode', host='192.168.0.10', port=15432);
    create node "PC1-dn2" with (type='datanode', host='192.168.0.10', port=15433);
    create node "PC2-cn" with (type='coordinator', host='192.168.0.11', port=5432);
    create node "PC2-dn1" with (type='datanode', host='192.168.0.11', port=15432);
    create node "PC2-dn2" with (type='datanode', host='192.168.0.11', port=15433);
    create node "PC3-cn" with (type='coordinator', host='192.168.0.12', port=5432);
    create node "PC3-dn1" with (type='datanode', host='192.168.0.12', port=15432);
    create node "PC3-dn2" with (type='datanode', host='192.168.0.12', port=15433);
    create node "PC4-cn" with (type='coordinator', host='192.168.0.13', port=5432);
    create node "PC4-dn1" with (type='datanode', host='192.168.0.13', port=15432);
    create node "PC4-dn2" with (type='datanode', host='192.168.0.13', port=15433);
  3. Start DataNodes in normal mode.
  4. Start Coordinator in normal mode.
That is all. Have fun.

Tuesday 6 May 2014

Interfacing Python with C/C++ (Passing Python List into C function)

The following illustrates an example on how to pass a Python list as argument into a C/C++ function and subsequently update the elements in it. It assumes that the list contains floating numbers. The example is based on Python/C API Reference Manual version 2.7.6

Compile the following file example.cpp into a dynamic link library. Here, I have used NetBeans 8.0 running on Ubuntu 14.04 to create a C++ dynamic link library project. The output is a file example.so.
//file: example.cpp

#include <Python.h>

extern "C" {
  PyObject* doo(PyObject* arg) {
  //get the number of elements in Python list
  long cnt = PyList_Size(arg);
  std::cout<<"Number of items: "<<cnt<<std::endl;

  for (long i=0; i<cnt; i++) {
    //print element value
    std::cout<<PyFloat_AsDouble(PyList_GET_ITEM(arg,i))<<std::endl;

    //change element in Python list
    PyList_SetItem(arg,i,PyFloat_FromDouble(i));
    std::cout<<PyFloat_AsDouble(PyList_GET_ITEM(arg,i))<<std::endl;
  }
  //return None
  return Py_None;
}

The following Python code loads the library and call the function doo. Do ensure the Python interpreter is version 2.7.6.
import ctypes

#load example.so (update path to library file accordingly)
dll = ctypes.CDLL('example.so')
#set the return data type of function doo to Python object
dll.doo.restype = ctypes.py_object
#set the argument data type of function doo to Python object,
#one element per argument in a list
dll.doo.argtypes = [ctypes.py_object]
l=[1.11 2.22 3.33]
print dll.doo(l)

The following is the output of the Python code.
Number of items: 3
1.11
0
2.22
1
3.33
2
None

To verify that the list has indeed been modified, print the list in python console.
>>> l
[0.0, 1.0, 2.0]

Thursday 17 April 2014

Stochastical Learning of Hand Written Digits

The video shows the result during the stochastic pre-training phase of an artificial neural network on a set of 60000 hand written digits which I tested based on the a simplistic learning mechanism of the biological counterpart. The result using this learning method seems promising.