Unstructured Data – Episode 1

Video Interview with Chris Dale

Chris Dale of the eDisclosure Information Project interviewed Brendan Sullivan, CEO of S2|DATA and Fred Moore, President of Horison Information Strategies, Inc. this past summer.

Over the course of three interviews, they cover the growing problem of managing legacy data and the increasing expectation that organizations comprehend what data they have and how they can find what matters using the solutions offered by S2|DATA.

In episode 1 of a 3-part series, Chris discusses with Brendan and Fred the “evolution of tape backup from its beginnings to the model established by S2|DATA.”

Watch Episode 1



Chris Dale: I am Chris Dale of the U.K. based eDisclosure Information Project. I’ve been speaking and writing about eDiscovery and its related topics for many years. eDiscovery embraces other and related topics from archiving through to privacy and cybersecurity. Central to all these subjects is the problem of identifying data. Knowing what you have and where it is. S2|DATA’s business is providing access to the vast amounts of legacy data kept by organisations. Much of that data is on systems whose hardware and software has long fallen out of use. Much of it is on tape. A lot of it is kept because it contains data which may be needed. Perhaps because for future possible litigation or for regulatory reasons. Much of it still exists just because no one knows that it’s there or because lawyers have said that it must be kept for some unspecified purpose. Much of it still exists because no one has time to look at it quite apart from the practical difficulties caused by old formats or redundant systems. This kind of data has always caused problems for eDiscovery and for regulatory purposes. How can you say you’ve found everything relevant when you don’t know what you have? There are numerous issues as well which make it increasingly important to know what data you’ve got. One is cyber risk. How can you say what you’ve lost if you don’t know what you had? How do you assess the risk? If you can’t assess the risk, how do you know what resources to apply to that risk? The other area lies in privacy and data protection. There are new laws which increasingly encourage individuals to ask what data you hold on them. The EU’s General Data Protection Regulation for example applies to many US organisations and the US is seeing an increase in privacy regulation such as the California Consumer Privacy Act. S2|DATA was founded to deal with exactly these problems. I’m talking here to Brendan Sullivan. CEO S2|DATA, and Fred Moore of Horison. Brendan, tell us briefly who you are.


Brendan Sullivan: I’m Brendan Sullivan. I’m the CEO and co-founder of S2|DATA. We formed this company in 2013. Me, personally, I’ve been involved in tape since 1985 working in Europe on the development of the first square tape cartridge 3480 working in production, manufacturing, engineering, sales, and marketing. Moved to the states in ’99 and for the last 20 years I’ve been involved in the services and software of restoration and discovery of data from tape in support of legal regulatory and business practices.


Chris Dale: Fred, how about you?


Fred Moore: My name is Fred Moore, and my career started with Storage Technology Corporation, a large data storage company–a major focus on tape technology. I spent 21 years with that company and finished as the corporate vice president of strategic planning and marketing. Twenty years ago, I founded Horison Information Strategies in Boulder, Colorado to continue to get deeper into the storage industry, and today I wrote the annual report for the Tape Storage Council, the Active Archive Alliance, and I do a lot of publishing and consulting with tape companies.


Chris Dale: Brendan, the discovery industry, indeed me, because I’ve known you for a long time, has known you for tape restorations for the past 20 years. What’s the business model for S2|DATA. What’s different about what you’re doing now?


Brendan Sullivan: Boy, our core competencies have probably not changed. We’re essentially an engineering company, a lot of IT engineers, software engineers, and we understand data right down to the hex level. How this has lent itself over the last 20 years has enabled us to write routines and code that emulate what backup softwares do so that we can get access to that data, and so that we can produce that data potentially right through to the courts in a defensible way. That technology development has lent itself to all sorts of other potential uses, and as you mentioned earlier with the increase in focus on the development of regulations privacy regulations, such as GDPR in the CCPA, but also modification of some of the federal rules in and around how data is to be produced and how it’s to be managed, there’s a surge in companies’ requirements to tackle their large data mountains at an information governance level. The way they manage their records, the way they retain, the way they dispose of them, remediate them, produce them, keep them, and so I would say that the technology that we’ve learned over the last 20 years as individual engineers and practitioners within this industry is now leaning itself more towards solving problems at a corporation level within information governance type type environments.


Chris Dale: So what do you see is wrong with the offsite storage voting model as it stands now?


Brendan Sullivan: So not sure I’d say what is wrong with it. I think there are gaps. I think there are reasons that it’s evolved. The offsite storage model as it pertains to backup tapes became very prevalent and through its growth years as being a short term physical location for the movement and transport of daily or weekly backups rotational media is as it as it might commonly be termed, and so consequently when you have that kind of requirement to produce data, create data, move it offsite for secure reasons and then bring it back on and write it over. That data is a backup and it already exists within the corporation, so as a result of it already existing intelligence about the data that’s on that tape is not really required once it’s in an offsite storage facility. This is where data should should go to stay for a long time, this is where it should should ultimately die, but the problem is that the way the industry has as has been brought up over the last 20 years, is there’s no intelligence there’s there’s no business analytics, there’s no data analytics that come from that unstructured data which as statisticians I think Fred mentioned to me yesterday that 80 percent of data that’s out there is is unstructured. If there’s no business analytics and you can’t make informed decisions about it, and so consequently we feel that there’s room for a new type of offsite storage vault. A company that or a capability that can provide analytics, remediation deletion, production, without the risks of moving data, without the risks of physically moving tape, without the delays and the time it takes to move those tapes back to a location where data can be read, and without the costs involved of storing what essentially is 80 to 90 percent of companies’ data that is largely either not wanted, useless or frankly a liability.


Chris Dale: The analytics we’re talking about here is that much the same as we know of any eDiscovery? Has it does it developed at the same pace offering the same sort of approach?


Brendan Sullivan: So I think we’re dealing with a huge amounts of data petabytes, zettabytes of data that is in an unstructured manner on backup and archive tapes and analytics on that data is it’s expensive and cumbersome. It’s much the same as as analytics on any kind of data in that you can only provide true analytics once you’ve indexed and processed at a text and metadata level the data on those storage mediums. This is not practical. If you have 100 petabytes of data and you want to provide analytics on a hundred petabytes of data it’s a hugely expensive and time consuming exercise. The analytics that we provide is a meta data level and a backup session level. There’s some great information that is relatively easily obtained from backup tapes or from backup databases. That is metadata and backup session. Backup session I mean when you’re creating a backup you might have an exchange backup or print server backup, document management servers. That can be critical information that would allow you to either leave it behind or keep it, produce it or ignore it. It’s relatively quick and easy to get that type of information without delving into the raw content data itself. And the same with file level metadata. We can extract file level metadata from backup tape and then provide analytics on that metadata so that decision making can be made regarding the data itself. So, it’s similar in that there’s a review and search, not similar in that we’re not indexing content we’re indexing metadata and backup session level information.


Brendan Sullivan: Turning to you Fred, can tape really be an alternative to the cloud as an archive medium?


Fred Moore: That’s one of the most common questions I get today regarding tape in the cloud and I think it’s important to realize that the major cloud providers in the world have gone heavily into tape is an archive solution. So they’ve endorsed tape in the same game that the traditional data center tape is. To me the question is what can they do with the tape in these cloud data centers. Compared to say what S2|DATA can do. With Brendan’s business he can actually go into the archive and add intelligence to the archive. I call it “reawakening the archive” because with metadata and various other capabilities he can make the archive searchable and play right into the eDiscovery, Big Data world that awaits it. So tape is actually a very good archive solution. It’s got the best total cost of ownership of any storage technology. It’s green, and it favours data that’s not used a lot Archive Data is often called worse data right once read seldom if ever. So this data really plays well the tape. And if it’s tape in the data center or tape in the cloud. My question is, who can do what with that tape, and that’s kind of where that S2|DATA comes into this picture.


Chris Dale: Brendan, you’d presumably endorse that.


Brendan Sullivan: Yeah I think as I mentioned earlier our technology it’s it’s been it’s been learned over the last 20 years those technologies that you that you’ve learned over a period of time of working in the discovery model make them particularly useful for for selective distribution or selective remediation, deletion or migration from from a lot of these large unstructured data pools. What we’ve done is we’ve created the ability to migrate data, so there’s a need to remediate, there’s a need to delete, there’s a need to ignore, and there’s a need to keep. Unfortunately over the last 20 years as backup systems have evolved, the way that data has been pushed back to tape, or pushed back to any storage medium for that matter, has not always been in the most structured way. It’s not always been in the most intelligent way and of course now when it comes to restoring or unraveling you want this but you don’t want that. And therein lies the problem. So companies have never really tackled the purging of data they don’t want and the retention of data they do want. It’s like looking under a stone. Once you look under you see something you don’t like, you put the stone back. What we’ve developed is rather than a selective means of deletion which is very difficult because you can’t append data that’s been written to tape. Our approach is migrated and selective migration. So what we’ve done is we’ve created a combination of skill sets. The first, is the ability to create an image of a physical volume. Physical volume being a tape we’ll call it a logical volume so that logical volume from a small capacity tape can go to a large capacity tape. We can shrink 100000 tapes put it all on 5000 tapes refresh it make it suitable for archiving for the next 50 years. That’s stage 1, but that’s the easy bit so to speak. We call that a tape session file or a tape duplicated file it’s a physical volume that’s been logical sized and then we’ll use LTFS, linear tape file system, to be able to restore those individual historical source volumes when we need. The clever stuff is to understand the data that was backed up in the source environment at a backup session level. As I mentioned earlier backup sessions they could be exchange sessions, print service sessions, document management, file share, whatever. Classifications of data that have been backed up from specific servers. Sometimes those retentions are long, sometimes they’re short, sometimes they create labor liability, sometimes they don’t. They’re typically separated by file marks within those backups session environments and they might be multi threaded across tens or even hundreds of backup tapes sets. What we do is we migrate from the old source media to a new media. We don’t land data but we selectively pick sessions and reconstitute those sessions on the destination logical image. So the beauty of this system is that we never land data we don’t have to restore. Restoration of data is expensive. If you’re gonna take 100000 tapes and restore the data it’s expensive. So projects they never get tackled. But if you’re gonna migrate it and pick what you need as it’s passing through our migration technology, then you get around those costs and in as well as getting around those costs you’re consolidating the future storage costs and you’re also console and you’re also decreasing the requirement for the data center to retain legacy backup software, legacy backup tape libraries and equipment, and keep around other resources that are required to restore that data. It’s on the latest technology. A further development of that is when we’re making that migration, we also extract file level metadata things like file extension file name path create modified an access date and we’ll port that to a web portal software we have called “Invenire”. It’s a review style web portal that manages file level metadata. So in one multi-stage process we’re migrating, we’re refreshing, we’re compressing shrinking, we’re remediating or defensively deleting, and we’re porting and creating data level intelligence to that unstructured archive at the same time. The outcome, as I mentioned earlier, it’s a faster time to data, it’s data that you want to keep, its data that you have to keep, it’s data that you’re going to use in the future, and it’s data that’s refreshed. That on this current modern technology modern type technology will last 50 plus years into the future.

Episode 2 of this video interview with Chris Dale will be published soon.