Data breaches have become a regular occurrence. It is almost every day that there is a breach. You would have heard of theEquifax credit data breach, which is one of the latest in a series of stories about major organizations’ data being compromised and exposed. The list is endless;Target’s customer credit card database,Anthem’s health insurance records, and why, even the FederalOffice of Personnel Management’s background check forms has not been spared! This is only the tip of the iceberg;critical servers and databases all over the world are compromised every single day. The frequency of the data breaches is only increasing and is only likely to worsen with time.Well, you have enough reason to be alarmed! Security of data is now no longer the exclusive domain specialty of database administrators and network engineers. In today’s scenario, just about everyone who creates, manages, analyzes, or even just has access to data is a potential point of failure in an organization’s data security plan. Thus, in case you handle sensitive data, (data which you would refuse to share freely with any random stranger on the Internet) then it’s your responsibility to ensure that you have undertaken adequate security measures to protect that data.The syllabus of formal training programs/grad schools for data scientists cover the data security aspects scantily, if at all, and thus, this is unfamiliar territory for most data scientists.However, that does not mean that we do not pay data security, the attention it rightly deserves, more so, since, when even a minor negligence in that grey area can undo all the good work that you have otherwise done. Now, where does one even begin, in case one wishes to take good care of one’s data?I will now dwell upon a few of the data security best practices that every good data scientist should follow:-
Take only what you need. The good old cardinal principle of sharing data strictly only on a “need to know basis” happens to be the first rule of data security. Data not held, cannot be lost, in the first place. It’s only when a clear need arises, that justifies the risk, that one should collect sensitive data. Even in that case, only get the absolutely minimum data that’s essential to complete the task. Data scientists are known to be tempted to hoard much more data than needed, out of the anxiety that they might require it later. This redundant accumulation of sensitive data can trigger what would be a minor cyber-security incident into a major disaster: so as a rule, just don’t do it.
Understand the data you have, and don’t keep data you don’t need anymore. Apply the same principles from Rule #1 above to your existing data, assuming you already have some data. It would make good sense to document the data you have by keeping a regular inventory of the data you have on hand, regularly analyze the sensitivity of each dataset, permanently erase data you don’t need, and take proactive steps to reducethe risks inherent in data held by you - for instance, by removing or alteringunstructured text fields, which can hide potentially sensitive data like phone numbers and names. It would also be prudent to think, not only about your own interests, but also about others too, when you have data about other people, by putting yourself in their shoes.
Data should always be encrypted, to the extent possible. Encrypting data is not the one stop solution; however, it’s a low-cost method of adding an extra layer of protection in case a hard drive or network connection gets compromised. Other than in certain applications which demand extremely high performance, encryption does not impede performance grossly, in today’s scenario. Sensitive data should therefore, be encrypted by default. The conflict of having a trade-off between performance and encryption is now redundant; there are plenty of high-performance applications and services with built-in encryption available now, for example, this is a standard feature in Microsoft’s Azure SQL Database.
Always use secure sharing services and NOT email, web servers, or basic FTP servers. These common and quick methods of sharing files could be a great way for normal day to day personal use like sharing vacation photos or sending invitations or sending courtesy season greetings etc. They are however loaded with dangers; in case used to share files with sensitive data. Use instead, existing services which arespecifically designed for sharing files securely; like an access-controlled S3 bucket on AWS – where you can manage sharing of encrypted files with other AWS users or an SFTP server -which implements secure file transfers over an encrypted connection. For the novices, even graduating to a service like Dropbox or Google Drive is a positive step in the right direction. Though not heavily loaded with security features, as some other tools, both Dropbox and Google still provide better fundamental security; they encrypt files at rest, for example, and allow for more fine-grained access control than sending files via email or parkingthem on a minimally-secured server. For an upgrade from Dropbox or Google, a servicelike SpiderOak One can provide end-to-end encryption for file storage and sharing, at the same time maintaining an easy-to-use interface, and at an affordable price-point – $5/month for 100GB, $12/month for 1TB.
Manu Jeevan is a self-taught data scientist and loves to explain data science concepts in simple terms. You can connect with him on LinkedIn, or email him at manu@bigdataexaminer.com.
Learn R, Python, basics of statistics, machine learning and deep learning through this free course and set yourself up to emerge from these difficult times stronger, smarter and with more in-demand skills! In 15 days you will become better placed to move further towards a career in data science. Upgrade to the specialization programs at attractive discounts!
Don't Miss This Absolutely Free, No Conditions Attached Course