name is Rohiit Talreja. I’m a product manager on
the Google Cloud Health Care and Life Sciences team. I focus specifically
on data governance, so I hope it makes sense
why I’m giving the Security and Compliance
session here today. Thanks for joining. I know it’s a little bit
late in the afternoon. Thanks for staying awake. I hope to keep you that way. So today, I’m here to talk
about what we call the shared responsibility model. And what that means is
when we have customers who are using a
Cloud service, it’s a little bit different than
older models of infrastructure, such as on-premise. So when we think about
the shared security model, there exist some
responsibilities that fall to the Cloud provider
and some responsibilities that fall to the user,
or the customer. And this is the frame of
reference for the talk today. We’ll be going over how Google
thinks about health care compliance with a focus
probably on HIPAA, and then talking about
what you as our users can do with that context,
and what else you may have to think
about as you build, design, and secure your
workloads on Google Cloud for health care data. So just setting a
little bit of context, health care data is under
duress as I call it. Health care organizations
are experiencing, on average, more than twice the
number of attacks compared to organizations in
other vertical categories. When we talk about attacks,
what are we talking about? But before we get to that, let’s
talk about why this matters. The cost of a data
breach is going up year over year over year. So a study that was conducted
between February of 2017 and April of 2018 calculated
that the average cost of a data breach was just
under $4 million. So what does this cover? This covers the
loss of customers due to reputation loss. This covers the data that was
affected, potentially fines coming over that, the
cost of forensics, the cost of communication
to regulators, to customers, to affected parties, and of
course, long-term damages. And long-term
damages to reputation being the primary
motivator of this cost. So when we talk about
causes of data breach, it’s also good to know
who is experiencing them. So this startling number,
over 90% of health care organizations have
experienced a breach within the last three years. And 50% of health
care organizations have suffered five
or more breaches in that same time frame. So when I say health care
data is under duress, I hope this now comes across. And again, it’s
important to point out that when we say a breach,
it’s not necessarily hacking. It’s not necessarily malicious. It could be a doctor
sending a fax– yes, a fax– to
the wrong office. It could be a patient sending
data from their doctor to a third party. It could be a doctor
not disposing of records appropriately when that
patient leaves the system. So it’s important to say that
a breach is a loaded term. It’s both malicious
and anything that’s defined in the regulation. So let’s talk more about
costs, specifically distributed denial of service attacks,
which are becoming more and more popular over time. We all remember the distributed
denial of service attacks that took place in the
last couple of years. They crippled a number
of institutions. They cost about $2
million a piece. And every 40 seconds, a
health care organization is hit with a ransomware attack. And the same assessment
said that 1 in 6 health care organizations are affected. Now that I’ve scared
you a little bit, let’s scare you a
little bit more. So when we think about the
number of attacks going up, one would assume that
the number of protections is also rising to meet
that increase in attacks. Actually, it’s
unfortunately the opposite. Cybersecurity budgets in
health care organizations have dropped to just
about 3% of total spend. This is not to say that
health care organizations’ budgets are decreasing overall. Actually, the amount that health
care organizations are spending is increasing year over year. However, the amount spent on
cybersecurity is constant, meaning that in
proportion, less money is being spent on
increasingly important areas. So why is health
care a prime target? I think most people
would know this, but we’ll talk a little
bit into specifics. Health care data is rich. And when I say the
data is rich, it means that I can get data
about multiple aspects of an individual just
by breaching one system. So if we think about
your email account, maybe that has some personal
identifying information. Your bank account maybe has
some personal identifying information. But really, when you
think about those security questionnaires you fill in– what was your
mother’s maiden name? What street did you grow up in? Where were your parents born? Where were you born? There’s really only one place
where all of that information is– health care records. And that means that
the total, let’s say, cost-benefit analysis
to a hacker looking for, where do I spend
my hacking moneys and time, health care data is
less protected and more useful. Therefore, let’s look at that. The other reason why health
care data is a prime target is because their data is stored
primarily on legacy systems. So we know cybersecurity
budgets are going down. We talked about that. Systems are older, easier
to breach, theoretically. And lastly, there is
a specific category of hackers who
hack for fun, hack to cause the most
damage possible. And when we look at health
care organizations taken down by things like
WannaCry, taken down by things like database
attacks, downtime impacts patient safety. And therefore, people
who get pleasure out of taking down
these organizations are looking to health care
because unfortunately, it causes a lot of damage. So where can we go from here? And as I said, this
talk will be focused on both what Google is doing
in the health care compliance and security space and what
you, as users of Google Cloud, can do to help prevent all
of these types of attacks. So level setting, I know
this audience probably knows more than the average
audience about what HIPAA is and what it entails. But HIPAA, US regulation for the
protection of health care data known as Protected
Health Information. So if you hear me
use the acronym PHI throughout this
talk, I’m referring to Health Care Information
as defined under HIPAA. But we can probably
generalize that to mean any sensitive health care
information globally. The same sort of
protections would apply. So HIPAA is broken into three
sections of requirements. We have administrative
safeguards, which is basically how
you run your business, how you hire people,
how you give them access to systems, what
they do with those systems, whether you plan
for disaster events. And then we have
physical safeguards. It’s like, when do you let
people into your buildings? Where do you store
your IT systems? Who gets access to
your IT systems? Before you give a vendor
access to your server closet, what sort of processes
are in place to make sure that they’re properly vetted
and they’re only doing what they should be doing? And lastly, the
biggest bucket, which is where I’ll spend
most of the time today, is technical safeguards. So HIPAA, being a
slightly older law, was not necessarily created
for the cloud environment. However, it has been
amended a couple of times to account for higher
tech activities. And it has been interpreted
by the government and by industry
groups to make sure that the technical
safeguards under HIPAA have somewhat caught up with
growing technology trends. So in terms of what we’ll
be talking about today, we’ll talk about identity and
access control, encryption and transmission protection,
audit logging, activity logging, and audit controls. So I’ll touch upon these both
first from the Google side and then from the customer side. So it’s important to
talk about the HIPAA BAA when talking about HIPAA. The Business Associate
Agreement, or BAA, is the contract that
formalizes the requirements between the service provider,
the business associate, and the covered entity– generally, the insurance
plan, the provider system, or the health care
information clearing house. And this BAA basically
formalizes the relationship and says both parties agree that
HIPAA data is being exchanged in this contract. And here are the
security requirements that are in place that
govern the use, protection, transmission of that
health care data. It’s important to note
that Google Cloud is one of the few
providers that offers a rigorous Enterprise-grade
BAA that covers a large number of GCP services. And even more than
that, we do it at no additional
cost to the customer because we believe that security
in health care is not optional. We shouldn’t give
you fewer protections just because you’re in health
care and have a tight budget. So all of our protections in
the BAA are there by default. There is no upcharge. And specifically, it includes
any region, any instance size that’s covered for the
service in the BAA, and has all the protections
around breach notification and encryption by default. So going into detail,
administrative safeguards. What are they? This is Google’s
security program. We design our security
system with an approach we call defense
in depth, whereas the traditional security model
is have a hard perimeter. Make it very hard for people
to get on to your network using things like tough
firewalls, VPNs, bastion hosts, dedicated machines that
can access that network. But think of it
like an eggshell. If you have a hard perimeter,
but then once you’re inside, a gooey center. So anybody who gets
onto that network can then have free
access to anything that is on the network, your EHR
systems, your billing systems. You name it. Anything that’s on
the network, they have if they can
compromise your network. Instead, Google enforces
defense in depth by putting different
sorts of protections that are relevant to different
attacks at different levels of the infrastructure. So for example, when
we talk about hardware, Google designs and plans our
own server hardware, storage hardware, compute hardware. And by doing so, not only do we
control the exact performance of the hardware to meet the
specifications that we need, we can also cut out
unnecessary components and control our
supply chain down to off-the-shelf
components, reducing “vendor in the middle” risk. Taking it all the
way to the top layer, we have a robust
identity layer in place that only allows approved
individuals to access services, and only services that
have been approved to communicate with each
other to perform activities with each other. So I won’t talk through all
of this slide in detail, but this is just trying to say
there is a defense in depth approach that is fundamental to
everything we do in security, down from our culture to
our technical controls and our operations. I think this slide,
hopefully, should give you an idea that we understand
security certifications. Table stakes. As part of the
administrative actions of running a secure
compliance program, Google has gone out and
gotten certifications that are relevant to
us as a cloud provider, such as ISO, which is
fundamental cloud security and privacy certifications,
as well as certifications that are specific to our
customers, which helps them grow on Google Cloud. So this is things like HITRUST. This is things like FedRAMP. Basically, certifications
where any vendor in the chain needs to be certified
so that the one that’s providing the final service
to the end customer can get certified as well. Continuing onto
administrative activities, Google regularly conducts
disaster recovery drills by simulating real
world scenarios. So these could be a fake
earthquake that potentially knocks out power to a data
center, which of course we simulate, hopefully. Dropping the power to the
internet, things like that. There was a day in the office
where I showed up to work. The video conferencing
cut mid-meeting. Somebody sent a text message
that said, oh, by the way, the internet is down for
the next three hours. Enjoy. Tell us what you did
so we can write it down for the actual
plan going forward. And I say this as
a cavalier story, but really, these
tests are designed to give teams the
ability to respond to real-world scenarios. And how they respond to the
test tests their plan of action and also helps them improve
their plan of action for when real incidents occur. So talking about, again,
administrative actions. What is Google doing to
run a proper HIPAA program, proper health care
compliance program? This is the list of services
covered by the BAA as of some time last week. I would probably
be willing to say that it’s already out of date,
in that we continue to add more services to the BAA over time. They’re being
added very rapidly. There’s probably a
couple more that have been added in the last week. So when we talk about physical
safeguards, facilities management. So physical safeguards, I
won’t spend too much time talking about because it
really just depends on, how do we run our data centers? How do we run our offices? When we talk about
our data centers, our data centers are some of
the most protected buildings in the world. They’re protected by
custom-designed electronic access cards, alarms, perimeter
defenses, laser-beam detection. Imagine a heist movie. Anything you see in that heist
movie, we probably have it. And there’s probably
some other stuff in there that they won’t
show in the movies. It’s safe to say facility
access is tightly controlled. Only approved employees
and their guests, and vendors,
potentially– people who should be there are there. They’re vetted before
they get there. Background checks
are done, and so on. Technical safeguards. Defense in depth. I talked about this. This just shows our
fundamental approach to technical safeguards, and
then going into some detail. We have encryption
by default at rest. And when I say by
default, what do I mean? What I mean is
customers don’t have to check a box that says
turn on default encryption. They don’t have to
say what size keys they want, what
level of encryption, what data stores are in
scope for encryption or not. No. Data on rest on Google
Cloud Platform is encrypted. This is a sample of one
layer of encryption. So different services,
different storage applications have maybe multiple
levels of encryption. This just shows an individual
file coming into Google Cloud. And what happens
here is that file is broken into multiple pieces. Each of these pieces is
encrypted with its own key. Then we wrap that key
with a key encryption key so that we don’t store
keys in plain text. The keys are also encrypted. And then each of these files
and their encryption key is put into a different
physical system in many cases. So this is basically saying
data is not only encrypted, it’s split up, encrypted,
and then distributed across the infrastructure
so that a failure in any one machine or multiple
machines does not compromise the integrity of the file. It also means that if, for
some hypothetical case, somebody got physical access
to one machine, the probability that, A, there is an entire file
on that machine is low, and B, all of the chunks are
encrypted with different keys. So the attack surface,
again, defense in depth is lowered with every step. So when we talk about endpoint
security, Google, on our end, we use devices that
are updated centrally. They auto update. They make sure that
there are strong security primitives on the device. And we also make these
devices and softwares available to our
customers, which we’ll talk a little bit about later. And when we talk about defense
in depth, one of the easiest things an organization can do
to secure their technology, to secure their systems, is
have two-factor or multi-factor authentication in place. This helps prevent
against phishing. This helps add an additional
layer of auditing and logging, and basically make
it so that the attack surface is, again, reduced. So now that we’ve talked a
little bit of an overview into how Google secures
our production systems and makes services
available to customers, I want to talk a little
bit about how customers should think about
their controls that are available to them. Not everything is going to be
handled by the Cloud provider, because Google wants to give
customers the flexibility to build the systems
that it needs to build. We don’t know what data you
want, data schema you want. We don’t know where
you want it stored. We don’t know what
network setup you want. But what we do want
to do is give you the controls to do what
it is you need to do. So I have a similar diagram. Our philosophy of
defense in depth transcends our own
sphere of influence. And we would like to give
customers that same ability. So customers are working on
potentially the same level of controls that we are. Customers care about
infrastructure security in some cases, but in our case,
we’ve taken care of that one. So customers care
about network security. They care about data security,
application security, identity and access management
endpoints, monitoring and operations. And then, of course, wrapping
it nice and tight together in that governance, risk,
and compliance framework. So the way I think about this
left column of governance, risk, and compliance
is basically good documentation and
policies around everything else on the slide. So if you don’t have good
data security, for example, your compliance is going
to be hard for you. So this is the
framework of protections that are in place, diving
into a little bit more detail. But first, touching
again on shared security, because I really think that’s
a fundamental consideration for this. When we think about our
infrastructure services, Google has taken care
of the hardware, how the underlying hardware
boots, what kernel it has, how the data is
encrypted and stored, how the audit logging is done. But then when we
make that guest OS, that is the layer at
which the controls become available to the customer. Do they want Linux? Do they want Windows? Do they want something
else, something proprietary? And that’s the level
of control to which customers start at on our
infrastructure services. If you don’t want
that level of control, if you don’t need that level of
flexibility for every system, why not take advantage
of one of the platform as a service solutions? And that’s because Google has,
again, made more sane choices about the security of those
systems all the way up to potentially the
application itself. In which case, you can drop
in your business logic, and rest assured that
the underlying features have been accounted for. And you have less and less
responsibility over time. And this translates also into
our software as a service solution, so G Suite, which
encompasses Gmail, Drive. Basically, the only thing
you’re doing in that case is managing the data, managing
who can access the data, and managing who you
share the data with. So going into those
same categories, where do customer
responsibilities lie in identity and access? First, we should talk about
what that means on Google Cloud. I think we can get a
little bit into some of the technical details here. We can say that on
Google Cloud, there are two main types of identities. We can have a human
identity, which is me doing work as myself. And we can also have
service accounts, or what we call robot identities. That’s like some service doing
work on behalf of a system, on behalf of a group of people,
serving an API, doing Cron jobs on data. Basically, things that a
human wanted to automate, so they let a machine do it
periodically, or a machine do it for the scope of security. The way these accounts
authenticate is different. So when you have a human
account, you log onto GCP. You’ll type in your
username and password. You’ll enter, hopefully, as
you should, your second factor or third factor
of authentication. And then you will
function as yourself. When you want to
authenticate to– when a service account
wants to authenticate, it’s provided a key. And that key basically
says, this service account is accredited for this
organization for the scope of– performing this
scope of operations. And it’s important to
note that at some point, things become similar. Both humans and service
accounts need IAM roles, IAM permissions on resources
in order to modify them, in order to use them,
things like that. So a key concept here– different types of identity,
different types of accounts are relevant to
different applications. If you’re doing an
admin operation, you might want to use a
human account because it’s an ad hoc, one-off thing. If you’re automating something,
if you’re providing a service, scoping down the exact service
operations to a robot account helps maintain security. And it helps maintain security
because that robot is scoped to only a limited
set of duties and it can’t override that scope. Taking this one step further. Now, we’ve talked about
manipulating resources on Google Cloud. What about securing
applications? So Google takes the identity
model one step further and has made something
available to customers called Identity-Aware Proxy. And what this is is a Load
Balancer and proxy that sits in front of applications. And when a user
request comes in, the IAP basically does a
check for who that user is, what they’re entitled to
do on the application, and those things have to match. So not only does it do
identity-based controls, it also does context controls. Is this user
accessing the service from an approved IP address? Are they accessing the service
from an approved partner service? So a lot of these checks can
be done at this proxy layer and make sure that only
legitimate requests, legitimate access can make it
into the back end application. So now talking
about user controls for encryption, data management,
and transmission protection. First, remember
when I showed you the diagram about how Google
encrypts data by default? Well, it turns out that’s only
one of the encryption options available on Google Cloud. It’s the leftmost one here. What we call,
default encryption. So it’s the same diagram we
have on the previous slide, just showcasing the other two
options for key management. So maybe you want a
little bit more control than the default encryption. You want to specify
which keys are used to encrypt which files, how
often those keys are rotated, when those keys are deleted,
and whether any of the keys need to be reused over time. Basically, the
default encryption has made all of those
decisions for you and offered it in a
managed service approach. But if any one of
those decisions sounds like a decision
you need to make, that’s what customer-managed
encryption keys are for. And what you do is you’d
configure the service. Say, I would like to
manage the encryption keys. When that service needs
to encrypt a file, it will go to find
the key you specify, encrypt it or decrypt it,
and then put that key back. There is an even more
controlled option called Customer Supplied Keys. And what this enables
you to do is not store any of the encryption
keys on Google Cloud. So both the first option,
which is default encryption, and the second option, customer
managed encryption keys, store the encryption keys
in what we call Google Cloud KMS, our Key Management System. If you want to store the
encryption keys on-prem, then you can use
customer-supplied encryption keys. And the way you would do
that is, in the API request to access a file, you
specify the encryption key. And that file is
decrypted and sent to you, and the key is not stored. And the same thing works
when you’re storing a file. So you give it the file. You give it the
encryption key you want. We perform our chunking
and encryption the same way we normally would,
but we use your key, and we destroy it
afterwards so that you have to keep supplying it. The one thing I will
caveat with this is that this can be
difficult to get right. It is possible, if you don’t
have a robust existing key management service on-prem
or on another cloud system– wherever you’re
storing your keys– you can encrypt data. We don’t store the key. And if you don’t store that key,
that data is effectively gone. So it’s important to say that
while this offers more control, it also offers more
responsibility. So another thing
that you can use to control the access and
permissions around data on Cloud is VPC
Service Controls. And what this allows you to do
is define a security perimeter around Google Cloud
Platform resources to constrain data
to that perimeter. Basically, control when
data leaves or comes into that security perimeter
and help mitigate data leaving that perimeter. So VPC has three main use cases. We’ll talk about mitigating
data exfiltration. So what this diagram
is showing here is by setting up the
virtual perimeter, you can prevent the number
of exfiltration pathways to only what you want. So let’s say that you
have a GCS bucket, and you only want
it to be accessed within services in
a specific network, but not by any services
outside of that network. So what you would do
is you would put– configure the VPC–
sorry– service controls so that the services
that should access the data are on the same
network, and services that shouldn’t access the
data are outside that. And then configure it such
that it basically auto rejects anything coming in that
isn’t already on that network. This is also a great enabler
for hybrid cloud, hybrid GCP, hybrid with other clouds, hybrid
cloud and on-prem, et cetera. So you can include on-prem
resources in your VPC network. And what that will
allow you to do is securely access
resources on Google Cloud from your on-premise
environment, and vice versa. So this is a really
good security control for extending your on-premise
network to cover a Google Cloud, again, and vice versa. And lastly, combined with
Identity-Aware Proxy, VPC is also an important
service for enforcing context aware access. So in this case, it’s not only
the identity of the accessing service or the
individual that matters, it’s also the context for which
they’re applying for access. So you can say, it’s like,
where is this user located? Maybe you have data
for some of your users that needs to stay within
a certain network boundary. You don’t want it to leave
their organization’s network. But you know that
it may potentially need to be accessed from one
of your partner organizations. And that access is allowed. So in this case, you can
set up your default network to be restricted to
one organization, block IP addresses outside that. But you can also allow accesses
from IP addresses belonging to your partner organization,
which both allows you the flexibility to
work with your partners, as well as having the security
of having data in the system. So talking about audit
and activity logging. One of the ways
Google has helped aggregate all of the
audit and activity logs is through Cloud
Security Command Center. And what Cloud Security
Command Center does is it scans through your Google
Cloud Platform resources, detects and responds to
threats in the system, and also aggregates
those for you. So when you think about issues
like misconfigured access policies, misconfigured network
policies, public storage buckets, issues that are
basically fundamental security checks that you want to be
conducting all the time, Cloud Security Command
Center is the service that would be
conducting those checks, and then surfacing that
data to your admins, to the project
owners, to make sure that the events are caught
as early as possible. One of the cool things about
Cloud Security Command Center is it was built for
a hybrid environment. So we know that people
here don’t necessarily just have Google Cloud
Platform resources. People have made investments
into on-prem technologies. People are using multi-cloud. And those don’t disappear the
moment you start using GCP. So we know that. We’ve built a number of
detectors that function well with our native
Cloud Platform tools, but we also integrate
with a number of partners who you may already be using,
who you may consider using. And we also have capabilities
to put your own monitoring data in Cloud Security
Command Center and make it that holistic audit
and logging tool for getting security and monitoring
insights into the organization. So getting towards the end of
the audit and activity logging session, I think
most people here know about Cloud Audit Logging. They know about
Access Transparency. To summarize,
Cloud Audit Logging logs your organization’s
activities on your data. And Access Transparency Logging
logs the cloud provider’s actions on your data. So this could be something
like an approved support activity, a bug ticket
you file for a service. If somebody at Google needs
to go in and help you resolve that ticket, you’ll get an
Access Transparency log that says, this ticket was resolved. We had to access this
project for this reason. And you can basically track
everything in one place. So how we tie this all together. So we know that Google
has some responsibilities. Customers have some
responsibilities. It’s important to showcase
how all of this works. And I’ll do this in two parts. So I’ll show you the
Cloud health care API, which is kind of a managed
service for data aggregation, data storage, data processing
on Google Cloud for health care data. And I’ll also show you
how that service can fit into a larger organization
architecture with an alignment on HIPAA compliance. So overview into the
Cloud health care API. Like I said, the
Cloud health care API implements industry-standard
protocols and formats. In this case, DICOM data
for radiology records for medical imaging
data, FHIR, which is for electronic
health records text, let’s say, and HL7, which
is for clinical messaging. And what we use
the cloud– or what we hope the Cloud health
care API will help enable is accelerated ingestion,
storage, analysis, and integration of
health care data with cloud-based applications. What do I mean by that? So the health care API
provides a secure gateway from an off-cloud system into
advanced capabilities on Google Cloud, like BigQuery for
analysis, TensorFlow for ML, ML Engine for more ML, because
you can never have too much ML. And together, we hope that
this tooling and this product helps you aggregate
your data in cloud and make it available across
modalities, across formats for holistic views of patients
to enable better research and care. How does health care API look? So the health care API is just
very similar to other Cloud Storage services. It is an API that
sits in a region. So think of London or LA,
or something like that. And within that health care API
region, you have a data set. And that data set consists of
multiple stores, or buckets, of different types
of health care data. So the interesting
thing to note here is a lot of the HIPAA
compliance responsibilities have been taken care of for you. So the storage layer is done. A lot of the networking
layer is done. And the encryption layer that
comes with the storage layer is also done. And the other thing
to notice here is that data is aggregated
across different modalities in a single data set. So we have text data
that’s structured. We have image data
that’s structured, pixel data, and text. And we have clinical messages. And all of those in
one data set helps enable different applications. So let’s imagine
that you’re trying to retrieve medical imaging
data for care or for creating AI or ML models. What you might want to do here
is use the health care API as that connection piece between
an on-premise PACS or a DICOM router. A DICOM router is basically
just a fancy name for something that speaks, reads, writes
medical imaging format DICOM and connects different
systems together. And then, on the other side
of that health care API, you have your analytics and
ML modules, your application ecosystem. Potentially, this is where
you let partners access data. You let patients access data,
and anything in between. When we talk about
HL7, important in the clinical space, not
necessarily much outside that. But basically, it’s the
way that different devices and operations within a hospital
communicate with each other. So if I’m a patient hooked up
to a glucose monitor or a blood pressure cuff, HL7
would be the format by which updates
for my condition get sent to my canonical
patient record. And what we’ve done
is we’ve turned this kind of weird
proprietary HL7 format into a structured JSON. What that does is it helps make
it available for analytics, research, you name it. So where’s this all going? Why is this important to the
HIPAA security presentation? What we’re doing
with the health care API is translating data
in multiple formats to a canonical standard
format, and then making that data available to
the rest of Google Cloud, all of the rest of the Google
Cloud HIPAA-compliant or HIPAA-aligned services. And what this
enables is it enables you to reduce the burden of
setting this up yourself. It enables you to reduce
the burden of setting up the security and compliance
for this translation layer, for the storage layer
yourself, and then lets you concentrate on doing
what I would call core business activities. I think in a lot
of cases, security is table stakes for
an organization. Security is something
you have to get right. But security is
not the end goal. Why are we securing data? It is so we can do
something with it. And that something
can be research. That something can
be patient care. That something can be improving
the quality of treatments. But securing the data
for the sake of security is generally not the end goal. And what we’re doing here is
wrapping that security layer into, I guess, a product layer
that can do other things. I think talking about
what some people consider the holy grail of
research is being able to see longitudinal patient
data from multiple sources in multiple formats
in one place. I’m not saying we’ve solved it. I’m just saying
there’s a possibility that it can be done. So you have data from
multiple formats. You want to get it into
the health care API. And then you want to
turn it into BigQuery, which is where you’ll get
your SQL-friendly Query language across all
of these domains which used to have different formats. From a security
purpose, maybe you want to deidentify your
data before you share it with other people. The health care API supports
native deidentification of data in these FHIR formats. So what we’ve done is we’ve
taken the structure of FHIR, parsed it, and said,
we know this field contains patient names. So if you want to take out
patient names, we can do that. We know some fields
contain dates. If those dates
need to be removed to protect patient
privacy or shifted to maintain patient
privacy, we can do that. What’s not shown
here is free text. So free text is another– they’re long strings
of provider notes. We can do that as well, it
just isn’t on this slide. Talking about doing
that in DICOM. DICOM files contain
patient data in metadata– or sorry, patient
information in metadata, which you see in the
top left and right, and patient data burnt
into the pixel data in the middle-bottom. Our deidentification tools
can also strip that out. And thinking about
HIPAA compliance, this is a fundamental ability
in data processing and data protection. So when you’re sending
data between organizations, one of the things that’s
important to HIPAA is data is protected in transit. And one of the ways you can
protect patient data in transit is to not send any patient
data in the first place. So deidentifying the
files before they’re sent between organizations,
before they’re published, is a great way to meet
the HIPAA requirements while still meeting the
business requirements. Finally, tying it all together. How can something
like this look? So the sample that
I have here is– the user journey is
you have some data that exists off of Google Cloud. Perhaps, this is on-prem, in a
colo in your own data center, on a third-party infrastructure,
on another cloud, on a managed service provider. There’s quite a number
of options here. And you want to move
that data to Cloud, get a canonical data
storage platform, then make that data available
to research partners, to your own organization,
to third-party services, and anything in between. But at the same
time, maintain HIPAA. That’s kind of the
key methodology here, or maintain GDPR. The controls can change, but
the intent stays the same. So the way I’ve broken
this up here is into four, what we call, GCP projects. And that’s the data ingestion
project, the data storage and analysis project,
the data sharing project, and the monitoring
and auditing project. I’ve included some underlying
management services here, like our logging platform,
our IAM platform, what we call our StackDriver services
for debugging and monitoring. Not because they’re
in a separate project, but just because they underlie
all of the other ones and I didn’t want to duplicate
them in all of the above. So again, so now we’ve
talked about the use case. Why is this use case relevant? This is what we think as
one of the fundamental user journeys for cloud migration. You’ll want to take data from
multiple disparate sources, ingest it, normalize it,
aggregate it, and then make it available for use. So breaking this down
into individual steps. How does HIPAA kind of interplay
in all of these things? So this is our first
project, data ingestion. So the data ingestion
layer is where you’ll move raw patient
data onto Google Cloud, perform an ETL operation,
extract, transform, and load, and basically change
that data from one format in its native
on-premise format to a cloud-native format. And here, the protections
that would be in place is it’s temporary storage. It’s ephemeral machines that
are doing transformation. And it’s on a separate network
because we don’t want– the network that goes
from on-premise to Cloud, we don’t want to contaminate
it by giving it access too far. Now that we’ve ingested
and normalized the data, we probably want to move it
to a canonical data store. And here is what some people
would call a data lake, a data aggregation layer. It does storage. It does analysis. It might be where you serve
some of the ML applications. It writes logs to
the audit project. And most importantly, no
external network connectivity. So this means that your
network for this application is locked down to only
your organization. It means that the
core data layer is only accessible by services
and service accounts that are providing– those service
accounts may be providing public services, but no public
access to this project directly for obvious reasons. But there does come
a time when you do want to provide
services externally. This, in some cases, can be
done in a different project. This may be done
in a different VPC. Here, we’ve shown it as both. It’s in a different project
with its own network. This project can connect
through service accounts to the data aggregation
layer, and it can also connect to external services. So maybe you want to serve
an ML model publicly. So this service would
expose a public end point. Somebody can make an
API request to it. It would then send the service
account back to this project. That service account request
would go grab something from the underlying data set. It would bring it
back to this project, aggregate it, and
share it publicly. So what we’ve left out so far
is the big component of HIPAA is centralized logging
and monitoring. If any patient data is touched,
there should be a log entry. If that log entry
is flagged, there needs to be manual
auditing of the activity to make sure that it was
legitimate or illegitimate, and then appropriate
action is taken afterwards. So what this project is doing is
simplifying all that monitoring for you. I won’t say for you, but
simplifying all that monitoring in one place. So all of the other projects
are sending audit logs here. All of the other projects
are sending system logs here, monitoring data. And then in this
project, you would put your rules, your
triggers, your alerts, your notifications, such that
not only is it seeing activity from individual
projects, it’s also seeing activity holistically. So it can account for trends
across different things in the environment. And the benefit of aggregating
all these logs together is specifically that. Now you get access
to full trends, to patterns, to
longitudinal accesses, and how data moves
throughout the system. Whereas if you were just looking
at logs from a single project, you wouldn’t get that. This is probably some, I
guess, of my opinions on best practices for audit logging. It’s what we recommend
for internal teams. It’s what we recommend to some
of our customers and partners as well. Turn on data access logging. So Cloud, we know, has two
forms of audit logging, admin activity, which
is on by default, and data access logging,
which captures access to data. So turn on data access
logging for services that are holding PHI. Set up audit log export. We talked about that,
and why it’s important. And configure access control
for logs appropriately. You want to avoid
the situation where somebody who did malicious
things to your data can then go delete the
audit logs for that. And lastly, you actually want
to look at the audit logs. Also obvious. This process of making
a HIPAA-aligned project architecture is something
that is not impossible, but it is common. And when our team looked
at all of these activities and said, what are
the things that go into making a
HIPAA-aligned project, it was basically things like
controlling access, controlling encryption, controlling network
boundaries, controlling audit logs. So what we did was we created
a set of open-source tooling. We’re unofficially calling it
the Data Protection Toolkit. And it allows for
infrastructure-as-code deployment of projects
that are designed to meet some sort of
organizational regulatory compliance requirements. So it leverages
Deployment Manager, which is a HIPAA-aligned
service and soon to be Terraform, which is a
popular open-source toolkit as well. What this actually
does is it aggregates an entire scope of HIPAA
activities into one toolkit. So what you’ll do is you’ll
define what resources you want. You want some compute, some
storage, some networking. And you’ll define the controls
that you want in place. Some resources
shouldn’t be public. Some resources should
prevent data access. Some resources should generate
more logs than others. And some resources should
be allowed to be shared. You’re basically–
there’s a controls library and a resources library. And you will choose
the specific scope of resources you’re
interested in and the specific scope of
policies you’re interested in. And this service will
automatically deploy them and create a GCP project. At the same time as
doing so, it will also put a continuous
monitoring framework around those resources. And what that continuous
monitoring framework will do is periodically
interrogate the resources you set up against the
policies that you set up and give you data. That data can be, yes,
everything’s good, or that data can be,
here’s a violation. You should probably
look into it. And what that enables is to
meet the HIPAA requirements of audit and monitoring all
from one continuous tool chain. Why did we choose Forseti
for the monitoring engine? So for those of you who aren’t
familiar with Forseti security, it’s a native
open-source tooling for Google Cloud that builds
an inventory of projects, scans it repeatedly
and periodically for a set of policies. And in a certain
case of policies, actually enforces against
malicious changes. So if you set an inventory– when we think back
to here, if you said, in the
deployment stage, you want three VMs and
two storage buckets, this inventory is
going to get created. And hopefully, it should contain
three VMs and two storage buckets. And if it doesn’t contain three
VMs and two storage buckets, that scanner is going to pick it
up and send you an alert so you can go back and correct it. And all of this, over
time, is kind of saying, if you defined your HIPAA
policies a certain way, as your organization grows,
you may need more resources, but new resources
that you create will be under the
same policy framework that you set up originally. So why we think this tooling
is helpful to customers. It’s a secure, what
we call, quick start. As I was saying, security
is fundamentally important, but it’s not necessarily
the end goal. So this is a framework
to help people get up and running quickly
on GCP by creating predictable, consistent, secure
workloads that then let you do your business requirements. So this is saying,
if you have to create identical development, testing,
and production environments to run a medical
device because that’s the regulatory
framework, this tool will allow you to write one
template, run it three times. The other thing it
allows you to do is if you have a
canonical data layer and you want to give
researchers or partners access to that data, and you want to
give it to them in a locked down environment,
these templates will let you spin up, lock
down identical environments for each research group
that’s accessing the data. And the other thing
is because this is an infrastructure-as-code
framework, it’s easy to share these
templates across teams and across institutions. So now talking about some
customers who are successfully running HIPAA-aligned workloads
on Google Cloud, three examples generally come to
mind that showcase the range of activities. So the University of
Colorado, Health Data Compass, has used Google Cloud to achieve
HIPAA compliance on a data warehouse. And this has helped them reduce
query times from many hours to just a few minutes, and has
helped them cut operating costs and make their research
programs more scalable. And this is, again,
following the model of shared responsibility. So previously, they were
managing all of this on-prem. By moving their infrastructure
to Google Cloud, they cut their
compliance responsibility by a significant portion
and realized other benefits at the same time, leading
to faster research. So a second example. You’ll start to see
a trend develop. Move workloads to Cloud
helps reduce responsibility, helps reallocate more
bandwidth, money, energy, time to other activities,
leading to the acceleration of the core business
requirements. So specifically for the NIH
National Institute on Aging, they were able to process
200 terabytes of data in just a few weeks,
which would normally have taken them months. This is an institution
that had hardware that was potentially older. They used the same amount of
funds on Cloud-based hardware and process that
data much faster. And lastly, the Broad Institute
is doing genomic analysis on Google Cloud. And moving their
infrastructure here has helped them accelerate
the analysis of human genomes by 400%. They’ve instituted new and
different security protections than they were
able to do on-prem. And again, the trend continues. Allocated resources better
led to more efficient research patterns. And with that, thank you
for joining us today. And I’ll be around to
take any questions. [APPLAUSE] [MUSIC PLAYING]