Status

Current Status:Normal

Cloud login reactivation (Status update 06.04.2023)

Dear cloud users,

at least we have some good news to share regarding the LRZ Compute Cloud:
We will reactivate the login to the Web management interface today.

As I wrote on the status page on Monday we did a lot of bugfixing in the past two and a half weeks.
We tracked down several timeouts in the virtual networking layer and patched some software components of OpenStack. Unfortunately this did not resolve the issue that VMs were losing their network connectivity.
As you might have noticed (most of) your VMs have been activated a week ago but might have lost its connectivity once in a while.
We used this week to observe the Cloud system with all user VMs running and try to stabilize the networking. We were using your VMs in order to ensure that we are testing our real system with more than 100 custom virtual networks owned by different users and more than 800 VMs.

There are some changes to your resources you might notice:

  1. We have added a rule to all security groups allowing us to ping every VM in the cloud.
    This rule allows us to keep track of any VMs losing their network.
    If you inspect your security groups you will find an additional entry to them:
    Rule: direction: ingress, ether_type: IPv4, protocol: icmp, port_range: [None,None], remote_ip_prefix: 10.156.242.250/29, remote_security_group: None
    "direction: ingress" means that this rule affects incoming network traffic
    "ether_type: IPv4" states that this rule affects IPv4 traffic (which is the type we use in the cloud)
    "protocol: ICMP" limits the traffic type to ICMP (which is used for "ping")
    "port:range: [None,None]" does not limit the traffic to any particular port or port range
    "remote_ip_prefix: 10.156.242.250/29" limits the traffic to a particular set of 6 IP addresses. This IP space is used by hardware machines in our Cloud network that we use to carry out monitoring tasks. These machines are not accessible from the outside and are only used by us administrators.
    "remote_security_group: None" does not limit traffic to any particular security group the sender needs to be in.

    Long story short: This rule allows us to ping all VMs from the particular hardware machine sitting in one of our cloud racks.
    This rule does not allow to ping your VMs from any address outside the compute cloud management network. If you want to ping your VM yourself you still need to add an appropriate rule allowing incoming ICMP traffic from everywhere or your custom IP (range).

    Please note that we require this rule in the future. Without this rule we simply have no way of keeping track of your VM's network connectivity.
    Do not remove this line from any of your security groups.
  2. We set the default DHCP lease time to infinity.
    Setting the DHCP lease time to infinity prevents your VM from dismissing its DHCP lease and stick with the assigned IP address forever.
    In our system OpenStack assign a private address (192.168...) to your VM when the network interface is created. This IP address stays assigned the entire lifetime of the particular interface. There is no need to frequently consult our DHCP servers to renew the lease because it will never change.
    By increasing the default lease time the VMs will keep the networking connection as long as the VM lives. We closely monitored the situation in the past 2 days and do not see VMs losing their connection at all.

In the course of the debugging and error-fixing work some of the VMs have been deleted. If you miss any VM you can simply create a new one from its still-existing volume that you can find in the section  "Volumes → Volumes" of the cloud's web interface.
Each volume that is not assigned to an existing VM is in state "Available". 
To create a new VM based on this volume click on the arrow pointing downwards next to the "Edit Volume" button and select "Launch as Instance".

I would like to thank you for your patience in the past two and a half weeks.
It has been a rough time and having an understanding community of users was quite helpful.
We did not have been very responsive in these times and will start working on the things we could not work on the past days.

If you encounter any problems with your VMs let us know.
Thank you and have a nice weekend.

Update 03.04.2023:

Hello,

it is hard to write update messages when the core message remains the same:
We are still tracking down software errors, misconfigurations, strange behaviour and package loss.

As I wrote earlier we fixed a lot of things until today but unfortunately this did not lead
to a stable system. We got rid of many error messages in the logs but yesterday network
connectivity to the running VMs did break down again.

I will let you know if I have any news to share.
Best regard.

Update 30.03.2023:

Good evening,

I want do provide a brief update of our current situation.
In the past days we have debugged and fixed many software and configuration
issues in our Cloud infrastructure.

Until today we only used our own virtual networks (like MWN and Internet).
However, there are more than 170 other virtual networks belonging to user projects
which increase the complexity in the virtual network layer quite a bit.
At the moment we are activating these networks and keep an eye open
for any log entries pointing to still existing problems.

As I wrote earlier we focus on bringing back up the Attended Cloud Housing resources
before we would start making the cloud available to all other users again.
Starting VMs on these particular hardware nodes will also add more complexity
to the system we need to monitor closely. 
Although we make progress I cannot give you a date when the cloud will be
available again.

Thanks for reading and have a nice evening!

Update 27.03.2023:

Dear Cloud users,

unfortunately we still have no good news to share.
At the moment we are working on bringing back the Attended Compute Cloud Housing Resources as quickly as possible.

We are discussing the next steps on how to cope with the situation regarding the "publicly" available Cloud nodes.

I apologize for any inconvenience and have a nice evening.

Original Message (21.03.2023):

Dear cloud-users,

as some of you might have noticed we experience networking problems with the LRZ Compute Cloud.

These problems exist since Monday.
Please do not write any emails regarding connection issues nor open support tickets.
We are aware of the problems and are currently working on tracking down these issues.
We cannot respond to all these tickets now (and need to close them afterwards with the exact same answer).

I will inform you if we have updates on that.
Sorry for any inconvenience.
Best regards, have a nice evening