How ‘auto-troubleshooting’ affects your availability? Here is the result, and step by step expression.
If you have too many application servers on your integration environment and you are the first responsible of your applications availability, I think you have to solve the problems with automations.
Firstly you should develop frontend test automation solutions and backend test automation solutions that watch the applications endlessly and record every step on logs. While they are working I have developed script that watches the logs of automation scripts.
Furthermore , I prepared some scripts that solve the common problems of application servers and some middleware applications. In addition to this , i developed a decisive script that categorizes the errors by error code. When the watcher script detected an error,it called the decisive script and the decisive script selected and called the solution script in the solution script library with the error code.
Before executing the solution scripts, a script removed the faulty server behind from the load balancer. If the execution is successful,problem is solved, and the script is sure about that, added the node again behind the load-balancer. Nobody recognized.
Of course sometimes it failed but here is the affect of this project on availability of our integration environment. I am very blissful now.
From %72 to %91. Great.