ICALEPCS2023 - Table of Session: TH2A (Software Architecture & Technology Evolution)

Paper	Title	Page
TH2AO01	Log Anomaly Detection on EuXFEL Nodes	1126
	A. Sulc, A. Eichler, T. Wilksen DESY, Hamburg, Germany
	Funding: This work was supported by HamburgX grant LFF-HHX-03 to the Center for Data and Computing in Natural Sciences (CDCS) from the Hamburg Ministry of Science, Research, Equalities and Districts. This article introduces a method to detect anomalies in the log data generated by control system nodes at the European XFEL accelerator. The primary aim of this proposed method is to offer operators a comprehensive understanding of the availability, status, and problems specific to each node. This information is vital for ensuring the smooth operation. The sequential nature of logs and the absence of a rich text corpus that is specific to our nodes pose a significant limitation for traditional and learning-based approaches for anomaly detection. To overcome this limitation, we propose a method that uses word embedding and models individual nodes as a sequence of these vectors that commonly co-occur, using a Hidden Markov Model (HMM). We score individual log entries by computing a probability ratio between the probability of the full log sequence including the new entry and the probability of just the previous log entries, without the new entry. This ratio indicates how probable the sequence becomes when the new entry is added. The proposed approach can detect anomalies by scoring and ranking log entries from EuXFEL nodes where entries that receive high scores are potential anomalies that do not fit the routine of the node. This method provides a warning system to alert operators about these irregular log events that may indicate issues.
	Slides TH2AO01 [1.420 MB]
DOI •	reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO01
About •	Received ※ 30 September 2023 — Accepted ※ 08 December 2023 — Issued ※ 13 December 2023
Cite •	reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

TH2AO02	High Availability Alarm System Deployed with Kubernetes	1134
	J.J. Bellister, T. Schwander, T. Summers SLAC, Menlo Park, California, USA
	To support multiple scientific facilities at SLAC, a modern alarm system designed for availability, integrability, and extensibility is required. The new alarm system deployed at SLAC fulfills these requirements by blending the Phoebus alarm server with existing open-source technologies for deployment, management, and visualization. To deliver a high-availability deployment, Kubernetes was chosen for orchestration of the system. By deploying all parts of the system as containers with Kubernetes, each component becomes robust to failures, self-healing, and readily recoverable. Well-supported Kubernetes Operators were selected to manage Kafka and Elasticsearch in accordance with current best practices, using high-level declarative deployment files to shift deployment details into the software itself and facilitate nearly seamless future upgrades. An automated process based on git-sync allows for automated restarts of the alarm server when configuration files change eliminating the need for sysadmin intervention. To encourage increased accelerator operator engagement, multiple interfaces are provided for interacting with alarms. Grafana dashboards offer a user-friendly way to build displays with minimal code, while a custom Python client allows for direct consumption from the Kafka message queue and access to any information logged by the system.
	Slides TH2AO02 [0.798 MB]
DOI •	reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO02
About •	Received ※ 06 October 2023 — Revised ※ 09 October 2023 — Accepted ※ 14 December 2023 — Issued ※ 18 December 2023
Cite •	reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

TH2AO03	An Update on the CERN Journey from Bare Metal to Orchestrated Containerization for Controls	1138
	T. Oulevey, B. Copy, F. Locci, S.T. Page, C. Roderick, M. Vanden Eynden, J.-B. de Martel CERN, Meyrin, Switzerland
	At CERN, work has been undertaken since 2019 to transition from running Accelerator controls software on bare metal to running in an orchestrated, containerized environment. This will allow engineers to optimise infrastructure cost, to improve disaster recovery and business continuity, and to streamline DevOps practices along with better security. Container adoption requires developers to apply portable practices including aspects related to persistence integration, network exposure, and secrets management. It also promotes process isolation and supports enhanced observability. Building on containerization, orchestration platforms (such as Kubernetes) can be used to drive the life cycle of independent services into a larger scale infrastructure. This paper describes the strategies employed at CERN to make a smooth transition towards an orchestrated containerised environment and discusses the challenges based on the experience gained during an extended proof-of-concept phase.
	Slides TH2AO03 [0.480 MB]
DOI •	reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO03
About •	Received ※ 06 October 2023 — Revised ※ 24 October 2023 — Accepted ※ 14 December 2023 — Issued ※ 19 December 2023
Cite •	reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

TH2AO04	Developing Modern High-Level Controls APIs	1145
	B. Urbaniec, L. Burdzanowski, S.G. Gennaro CERN, Meyrin, Switzerland
	The CERN Accelerator Controls are comprised of various high-level services that work together to provide a highly available, robust, and versatile means of controlling the Accelerator Complex. Each service includes an API (Application Programming Interface) which is used both for service-to-service interactions, as well as by end-user applications. These APIs need to support interactions from heterogeneous clients using a variety of programming languages including Java, Python, C++, or direct HTTP/REST calls. This presents several technical challenges, including aspects such as reliability, availability and scalability. API usability is another important factor with accents on ease of access and minimizing the exposure to Controls domain complexity. At the same time, there is the requirement to efficiently and safely cater for the inevitable need to evolve the APIs over time. This paper describes concrete technical and design solutions addressing these challenges, based on experience gathered over numerous years. To further support this, the paper presents examples of real-life telemetry data focused on latency and throughput, along with the corresponding analysis. The paper also describes on-going and future API development.
	Slides TH2AO04 [2.676 MB]
DOI •	reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO04
About •	Received ※ 03 October 2023 — Revised ※ 12 October 2023 — Accepted ※ 17 December 2023 — Issued ※ 18 December 2023
Cite •	reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

TH2AO05	Secure Role-Based Access Control for RHIC Complex	1150
	A. Sukhanov, J. Morris BNL, Upton, New York, USA
	Funding: Work supported by Brookhaven Science Associates, LLC under Contract No. DE-SC0012704 with the U.S. Department of Energy. This paper describes the requirements, design, and implementation of Role-Based Access Control (RBAC) for RHIC Complex. The system is being designed to protect from accidental, unauthorized access to equipment of the RHIC Complex, but it also can provide significant protection against malicious attacks. The role assignment is dynamic. Roles are primarily based on user id but elevated roles may be assigned for limited periods of time. Protection at the device manager level may be provided for an entire server or for individual device parameters. A prototype version of the system has been deployed at RHIC complex since 2022. The authentication is performed on a dedicated device manager, which generates an encrypted token, based on user ID, expiration time, and role level. Device managers are equipped with an authorization mechanism, which supports three methods of authorization: Static, Local and Centralized. Transactions with token manager take place ’atomically’, during secured set() or get() requests. The system has small overhead: ~0.5 ms for token processing and ~1.5 ms for network round trip. Only python based device managers are participating in the prototype system. Testing has begun with C++ device managers, including those that run on VxWorks platforms. For easy transition, dedicated intermediate shield managers can be deployed to protect access to device managers which do not directly support authorization.
DOI •	reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO05
About •	Received ※ 04 October 2023 — Revised ※ 14 November 2023 — Accepted ※ 19 December 2023 — Issued ※ 22 December 2023
Cite •	reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

TH2AO06	SKA Tango Operator	1155
	M. Di Carlo, M. Dolci INAF - OAAB, Teramo, Italy P. Harding, U.Y. Yilmaz SKAO, Macclesfield, United Kingdom J.B. Morgado Universidade do Porto, Faculdade de Ciências, Porto, Portugal P. Osorio Atlar Innovation, Pampilhosa da Serra, Portugal
	Funding: INAF The Square Kilometre Array (SKA) is an international effort to build two radio interferometers in South Africa and Australia, forming one Observatory monitored and controlled from global headquarters (GHQ) based in the United Kingdom at Jodrell Bank. The software for the monitoring and control system is developed based on the TANGO-controls framework, which provide a distributed architecture for driving software and hardware using CORBA distributed objects that represent devices that communicate with ZeroMQ events internally. This system runs in a containerised environment managed by Kubernetes (k8s). k8s provides primitive resource types for the abstract management of compute, network and storage, as well as a comprehensive set of APIs for customising all aspects of cluster behaviour. These capabilities are encapsulated in a framework (Operator SDK) which enables the creation of higher order resources types assembled out of the k8s primitives (\verb\|Pods\|, \verb\|Services\|, \verb\|PersistentVolumes\|), so that abstract resources can be managed as first class citizens within k8s. These methods of resource assembly and management have proven useful for reconciling some of the differences between the TANGO world and that of Cloud Native computing, where the use of Custom Resource Definitions (CRD) (i.e., Device Server and DatabaseDS) and a supporting Operator developed in the k8s framework has given rise to better usage of TANGO-controls in k8s.
	Slides TH2AO06 [2.622 MB]
DOI •	reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO06
About •	Received ※ 27 September 2023 — Revised ※ 24 October 2023 — Accepted ※ 14 December 2023 — Issued ※ 21 December 2023
Cite •	reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

Paper

Title

Page

Log Anomaly Detection on EuXFEL Nodes

1126

A. Sulc, A. Eichler, T. Wilksen
DESY, Hamburg, Germany

Funding: This work was supported by HamburgX grant LFF-HHX-03 to the Center for Data and Computing in Natural Sciences (CDCS) from the Hamburg Ministry of Science, Research, Equalities and Districts.
This article introduces a method to detect anomalies in the log data generated by control system nodes at the European XFEL accelerator. The primary aim of this proposed method is to offer operators a comprehensive understanding of the availability, status, and problems specific to each node. This information is vital for ensuring the smooth operation. The sequential nature of logs and the absence of a rich text corpus that is specific to our nodes pose a significant limitation for traditional and learning-based approaches for anomaly detection. To overcome this limitation, we propose a method that uses word embedding and models individual nodes as a sequence of these vectors that commonly co-occur, using a Hidden Markov Model (HMM). We score individual log entries by computing a probability ratio between the probability of the full log sequence including the new entry and the probability of just the previous log entries, without the new entry. This ratio indicates how probable the sequence becomes when the new entry is added. The proposed approach can detect anomalies by scoring and ranking log entries from EuXFEL nodes where entries that receive high scores are potential anomalies that do not fit the routine of the node. This method provides a warning system to alert operators about these irregular log events that may indicate issues.

Slides TH2AO01 [1.420 MB]

DOI •

reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO01

About •

Received ※ 30 September 2023 — Accepted ※ 08 December 2023 — Issued ※ 13 December 2023

Cite •

reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

TH2AO02

High Availability Alarm System Deployed with Kubernetes

1134

J.J. Bellister, T. Schwander, T. Summers
SLAC, Menlo Park, California, USA

To support multiple scientific facilities at SLAC, a modern alarm system designed for availability, integrability, and extensibility is required. The new alarm system deployed at SLAC fulfills these requirements by blending the Phoebus alarm server with existing open-source technologies for deployment, management, and visualization. To deliver a high-availability deployment, Kubernetes was chosen for orchestration of the system. By deploying all parts of the system as containers with Kubernetes, each component becomes robust to failures, self-healing, and readily recoverable. Well-supported Kubernetes Operators were selected to manage Kafka and Elasticsearch in accordance with current best practices, using high-level declarative deployment files to shift deployment details into the software itself and facilitate nearly seamless future upgrades. An automated process based on git-sync allows for automated restarts of the alarm server when configuration files change eliminating the need for sysadmin intervention. To encourage increased accelerator operator engagement, multiple interfaces are provided for interacting with alarms. Grafana dashboards offer a user-friendly way to build displays with minimal code, while a custom Python client allows for direct consumption from the Kafka message queue and access to any information logged by the system.

Slides TH2AO02 [0.798 MB]

DOI •

reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO02

About •

Received ※ 06 October 2023 — Revised ※ 09 October 2023 — Accepted ※ 14 December 2023 — Issued ※ 18 December 2023

Cite •

reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

TH2AO03

An Update on the CERN Journey from Bare Metal to Orchestrated Containerization for Controls

1138

T. Oulevey, B. Copy, F. Locci, S.T. Page, C. Roderick, M. Vanden Eynden, J.-B. de Martel
CERN, Meyrin, Switzerland

At CERN, work has been undertaken since 2019 to transition from running Accelerator controls software on bare metal to running in an orchestrated, containerized environment. This will allow engineers to optimise infrastructure cost, to improve disaster recovery and business continuity, and to streamline DevOps practices along with better security. Container adoption requires developers to apply portable practices including aspects related to persistence integration, network exposure, and secrets management. It also promotes process isolation and supports enhanced observability. Building on containerization, orchestration platforms (such as Kubernetes) can be used to drive the life cycle of independent services into a larger scale infrastructure. This paper describes the strategies employed at CERN to make a smooth transition towards an orchestrated containerised environment and discusses the challenges based on the experience gained during an extended proof-of-concept phase.

Slides TH2AO03 [0.480 MB]

DOI •

reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO03

About •

Received ※ 06 October 2023 — Revised ※ 24 October 2023 — Accepted ※ 14 December 2023 — Issued ※ 19 December 2023

Cite •

reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

TH2AO04

Developing Modern High-Level Controls APIs

1145

B. Urbaniec, L. Burdzanowski, S.G. Gennaro
CERN, Meyrin, Switzerland

The CERN Accelerator Controls are comprised of various high-level services that work together to provide a highly available, robust, and versatile means of controlling the Accelerator Complex. Each service includes an API (Application Programming Interface) which is used both for service-to-service interactions, as well as by end-user applications. These APIs need to support interactions from heterogeneous clients using a variety of programming languages including Java, Python, C++, or direct HTTP/REST calls. This presents several technical challenges, including aspects such as reliability, availability and scalability. API usability is another important factor with accents on ease of access and minimizing the exposure to Controls domain complexity. At the same time, there is the requirement to efficiently and safely cater for the inevitable need to evolve the APIs over time. This paper describes concrete technical and design solutions addressing these challenges, based on experience gathered over numerous years. To further support this, the paper presents examples of real-life telemetry data focused on latency and throughput, along with the corresponding analysis. The paper also describes on-going and future API development.

Slides TH2AO04 [2.676 MB]

DOI •

reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO04

About •

Received ※ 03 October 2023 — Revised ※ 12 October 2023 — Accepted ※ 17 December 2023 — Issued ※ 18 December 2023

Cite •

reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

TH2AO05

Secure Role-Based Access Control for RHIC Complex

1150

A. Sukhanov, J. Morris
BNL, Upton, New York, USA

Funding: Work supported by Brookhaven Science Associates, LLC under Contract No. DE-SC0012704 with the U.S. Department of Energy.
This paper describes the requirements, design, and implementation of Role-Based Access Control (RBAC) for RHIC Complex. The system is being designed to protect from accidental, unauthorized access to equipment of the RHIC Complex, but it also can provide significant protection against malicious attacks. The role assignment is dynamic. Roles are primarily based on user id but elevated roles may be assigned for limited periods of time. Protection at the device manager level may be provided for an entire server or for individual device parameters. A prototype version of the system has been deployed at RHIC complex since 2022. The authentication is performed on a dedicated device manager, which generates an encrypted token, based on user ID, expiration time, and role level. Device managers are equipped with an authorization mechanism, which supports three methods of authorization: Static, Local and Centralized. Transactions with token manager take place ’atomically’, during secured set() or get() requests. The system has small overhead: ~0.5 ms for token processing and ~1.5 ms for network round trip. Only python based device managers are participating in the prototype system. Testing has begun with C++ device managers, including those that run on VxWorks platforms. For easy transition, dedicated intermediate shield managers can be deployed to protect access to device managers which do not directly support authorization.

DOI •

reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO05

About •

Received ※ 04 October 2023 — Revised ※ 14 November 2023 — Accepted ※ 19 December 2023 — Issued ※ 22 December 2023

Cite •

reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)

TH2AO06

SKA Tango Operator

1155

M. Di Carlo, M. Dolci
INAF - OAAB, Teramo, Italy
P. Harding, U.Y. Yilmaz
SKAO, Macclesfield, United Kingdom
J.B. Morgado
Universidade do Porto, Faculdade de Ciências, Porto, Portugal
P. Osorio
Atlar Innovation, Pampilhosa da Serra, Portugal

Funding: INAF
The Square Kilometre Array (SKA) is an international effort to build two radio interferometers in South Africa and Australia, forming one Observatory monitored and controlled from global headquarters (GHQ) based in the United Kingdom at Jodrell Bank. The software for the monitoring and control system is developed based on the TANGO-controls framework, which provide a distributed architecture for driving software and hardware using CORBA distributed objects that represent devices that communicate with ZeroMQ events internally. This system runs in a containerised environment managed by Kubernetes (k8s). k8s provides primitive resource types for the abstract management of compute, network and storage, as well as a comprehensive set of APIs for customising all aspects of cluster behaviour. These capabilities are encapsulated in a framework (Operator SDK) which enables the creation of higher order resources types assembled out of the k8s primitives (\verb|Pods|, \verb|Services|, \verb|PersistentVolumes|), so that abstract resources can be managed as first class citizens within k8s. These methods of resource assembly and management have proven useful for reconciling some of the differences between the TANGO world and that of Cloud Native computing, where the use of Custom Resource Definitions (CRD) (i.e., Device Server and DatabaseDS) and a supporting Operator developed in the k8s framework has given rise to better usage of TANGO-controls in k8s.

Slides TH2AO06 [2.622 MB]

DOI •

reference for this paper ※ doi:10.18429/JACoW-ICALEPCS2023-TH2AO06

About •

Received ※ 27 September 2023 — Revised ※ 24 October 2023 — Accepted ※ 14 December 2023 — Issued ※ 21 December 2023

Cite •

reference for this paper using ※ BibTeX, ※ LaTeX, ※ Text/Word, ※ RIS, ※ EndNote (xml)