My GSoC experience with CERN-HSF, Summer 2018

The Google Summer of Code (GSoC) program is one of the most prestigious programs for student developers, who are eager to work and demonstrate their skills in a full-fledged working environment. The program gives this opportunity to students all around the world and I feel good that I was selected for this program with the CERN-HSF organization, one of the pioneer organizations for nuclear research.

My experience started before the GSoC program, with the first contact with the current mentor, Federico Stagni where we discussed the proposed project and my suitability for it by performing some tasks that are available here. Based on performance and my timelines, I submitted my proposal which went through multiple revisions with inputs from my mentor Federico and self-improvements. Finally, almost after a month of wait, students selected for GSoC 2018's program were announced and found myself fortunate to be selected with the following project with CERN-HSF:- Monitoring and traceability of jobs using ElasticSearch - DIRAC.

Now moving towards the coding period after selection, the course was a rigid three-month program which had weekly tasks based on the proposal submitted. It would be difficult to give a description of work in a single blog on a weekly basis so I would be giving a condensed view below of my work during the summer. If anyone is interested in following my weekly work, they can read my regular posts about GSoC work here.

The work during this summer was divided between the following milestones:

Add ElasticSearch (ES) backend for Job Monitoring

The Job Monitoring module in Workload Management System used MySQL backend to store and access Job Parameters. But the only caveat of using MySQL is that it is a relational database and also limits the queries that can be processed due to the relationship between keys.
Keeping the above in mind, using ElasticSearch (ES) which is NoSQL DB (a non-relational database) seemed a good choice and hence set up ES indices and wrote functions setJobParameters and getJobParameters which accessed the indices and wrote and accessed the values available, as clear from terms set and get.
The code implementation regarding the above work can be found in Pull Request #3708.
The above PR was merged with the integration branch of DIRACGrid project.

Add Job Attributes to ElasticSearch backend

The Jobs table contains a set of values which are most commonly accessed by the queries processed as per the requirement. Hence, it becomes important that these values are moved to ES backend, as it would make query processing efficient as well as open up newer queries that can be performed.
As discussed in the previous point, the attributes that were moved to ES are:

Job Group, Owner, Proxy, Submission Time, Running Time

In order to set and retrieve these attributes, function setJobParameters is modified to accept these attributes as kwargs (keyword arguments) and then set these values as and when specified by the user. A new function is introduced in ElasticJobDB.py named getJobParametersAndAttributes which returns both parameters and attributes mentioned above when given a JobID.
The commits related to the work done can be found below:

Add attributes to ES: 677ad2333d031c82e502aba20d48ac3118ec3ddb
Modifications to ES functions: 0b1a63594eec932c12669d3274d5da705d9e2ffb

The commits were merged to the integration branch as part of PR 3744.

**Add new table JobsStatus to MySQL backend**

As previously stated, not all the values linked to particular JobID are accessed as often than some of the columns of the Jobs table.
Hence separating status values from the Jobs table is an efficient way of accessing these values as these are most often queried and would make the processing more efficient in terms of traversing rows and columns of the table when compared to much-loaded Jobs table.
The columns of the new tables are as follows:

JobID (primary key), Status, MinorStatus, and ApplicationStatus.

Along with this, two functions were written/modified to access the new table:

getJobStatus (to retrieve values from the table)
setJobStatus (to set values in the table)

These two functions were then used in multiple modules of WMS to write and access the new table.
The commits related to the work done can be found below:

Add tables and functions for JobStatus: b60cd0f44e9c2cb134565613fe553404d64b78b6
Replace functions with new ones in services: 75885bcf2cb64c60e193dc847d943ea457b8ff7c

The commits were merged to the integration branch as part of PR 3744.

Add Clients for Workload Management System

It is desirable that WMS agents don't access the DB's (both ElasticSearch and MySQL) directly and instead access them via running service. These services are initiated using RPClient() and hence need to be initiated and used to access or write to DB's.
But at the same time, it is also needed that the modules themselves don't call these RPCClients and use the Client class which initiates the service.
Keeping the above points in mind, I added the following clients in WMS and replaced their RPCClient invokes to these classes:

JobStateUpdateClient.py
JobManagerClient.py

The code implementation regarding the above work can be found in Pull Request #3760.
The above PR was merged with the integration branch of DIRACGrid project.

Add tests for WMS Agents

It is of common knowledge that with development, testing becomes an important part of the whole process. It is essential as we keep changing the codebase, we need to ensure that new changes don't affect the existing functionality and doesn't disturb the whole process.
Since the tests for WMS agents were quite limited, I added the tests written in accordance with 'pytest' module for the following agents:

JobAgent.py
JobCleaningAgent.py
PilotStatusAgent.py
StalledJobAgent.py

Apart from writing these tests, it was important that each of the function was accessible from the agent's module. Hence, modified the above-mentioned agent modules for testing purposes.
The commits related to the above work can be found in Pull Request 3771.
The above PR has a pending review from code owners which will be merged soon after further additions and Python 3 support.

Modify codes to support Python 3

Currently, the DIRACGrid project is completely written in Python2 with some of the modules used by the project only supported in Python 2.
But at the same time, it's important that by 2020, almost all the support for Python 2 will be over as most of the modules today are adding new features only supported in Python 3.
With this in mind, I proposed as an optional task to write my codes that support both Python 2 and 3. This task is achieved by using the tool python-modernize which modifies existing Python 2 codes to support Python 3 by modifying parts which work both in Python 2 and 3.
The commits related to the above work can be found in Pull Request 3765.
The above PR has a pending review from code owners as there are still many things to go through for the whole organization before complete porting to Python 3 can be done.

I guess the above milestones fairly give an overall idea of my work done during the summer. Apart from this, I contributed to documentation and performed performance test to analyse the differences between MySQL backend and ElasticSearch. The commits can be found in the above-linked PR's.

A complete list of my Pull Requests can found in the following link: PR's by radonys.

That brings me towards the end of GSoC journey. But it's not the end to contribution in open-source projects. I will continue working with DIRACGrid project in my own capacity and would contribute to having a significant impact on the open-source community.

Yash Srivastava

Search This Blog