Software repairs itself on the go

By Kimberly Patch, Technology Research News

Scientists have been working to make computers more reliable since the first ones were built.

And as information like medical records and business transactions is increasingly expected to be always accessible, fault tolerance and ways to keep data from becoming corrupted have become important subjects of research efforts around the globe.

Researchers from the Massachusetts Institute of Technology have devised a way to repair data on-the-fly, allowing a system to keep on computing after it has sustained an error that would normally require a restart.

The researchers' method uses fewer computer resources than the log-rollback-and-replay methods of data repair in use today. The approach makes this type of fault tolerance less expensive and therefore potentially available for a wider range of applications.

The challenge in repairing data on-the-fly is finding what needs fixing. Key to the researchers' automatic data structure repair method is a relatively simple way to find data errors.

Rather than looking for errors in computer code, the scheme calls for examining abstract models of data that show relationships between data objects. Using abstract representations eliminates much of the information-encoding complexity usually involved, said Martin Rinard, an associate professor of computer science at the Massachusetts Institute of Technology. Data structure corruptions become apparent in the model in the same way health problems like a broken bone or cancerous tissue show up in a graphical representation of data like a CAT scan or MRI, he said.

The researchers' method includes a set of rules that translates data structures into an abstract model and another set of rules that uses the model to express key consistency properties, said Rinard.

When the system finds an inconsistency, it carries out a sequence of actions to fix the problem. The system can put pieces back together, initialize and insert a missing element, or remove a corrupted part, said Rinard. The system can also shift the values of variables to make them self-consistent, he said.

The system can also carry out combinations of repairs, said Rinard. "The algorithm may remove some corrupted parts, then fill the resulting hole with newly allocated and initialized data structures," he said.

The method's big advantage is that it can keep the system going even though there are errors that would normally bring it to a halt, said Rinard.

Computer programs generally assume that data has a certain amount of consistency, and if data structures get into a state in which they do not, the software may not be able to operate at all, bringing the system to a halt. "In many cases even a small inconsistency in one part of the data structures can completely prevent the software from correctly executing any of the data, even data that is not corrupted," said Rinard.

The researchers' approach is different from the rollback-and-replay methods in practical use today, said Rinard. These involve logging all transactions, and, when a failure occurs, rolling back to a consistent state and replaying the transactions to bring the data up to date.

Instead of rolling back when a failure or error occurs, the researchers' method makes the inconsistent state consistent, said Rinard. This allows a system to keep running despite errors that would ordinarily bring it to a halt until a human operator could intervene. "In particular, if a software error always makes a particular operation fail, the database can never get past that operation," he said.

The method also decreases the software overhead of failure recovery. The amount of code devoted to traditional failure recovery schemes is fairly high -- a large percentage of the code that runs a database is devoted to recovery, according to Rinard.

Data structure repair has traditionally been applied manually -- and at a very high development cost -- to systems with extreme reliability requirements, said Rinard. "One of the goals of our research is to automate the technique to the extent that it becomes feasible to apply routinely to all kinds of software," he said.

This is especially apt today, as computer systems rapidly proliferate throughout the environment, including systems embedded in cars, buildings and public places, said Rinard. "Many such systems must execute without any human intervention," he said. "Our research may turn out to [enable] the systems to detect and repair damage to themselves so that they can continue to execute successfully without human intervention," he said.

The potential downside of such automation, however, is that an error might destroy information or otherwise make the repaired data structures inadequate, said Rinard. In some cases it might be better to let the system fail, then wait for human intervention to bring it back up, he said.

The researchers' simulations show that in many situations it could prove practical to fix errors automatically rather use the more cumbersome roll back method.

The data structure abstraction method "minimizes the amount of effort required to apply data structure repair... and makes it much easier to understand what the repair algorithm will produce," said Rinard. The simulations also show, surprisingly, that most errors involve relatively simple repair sequences, he said.

The researchers are currently looking at how the method works in real-world applications and are working on improving its efficiency, said Rinard.

Longer-term, the abstract approach could be used to make all kinds of complex data manipulations much easier, said Rinard.

The method could be ready for practical use in two to four years, said Rinard. The method can be applied in small steps to currently deployed systems written in standard programming language, he said. "It does not require changes to existing development techniques, and can be applied initially by a small group working within the context of a much larger development effort," he said.

Rinard's research colleague was Brian Demsky. They presented the work at the Association of Computing Machinery (ACM) Workshop on Algorithms and Architectures for Self-Managing Systems in San Diego on June 11, 2003. The research was funded by the National Science Foundation (NSF), the Defense Advanced Research Projects Agency (DARPA) and the Singapore-MIT Alliance (SMA).

Timeline:   2-4 years
Funding:   Government, University
TRN Categories:   Operating Systems
Story Type:   News
Related Elements:  Technical paper, "Automatic Data Structure Repair for Self-healing Systems," presented at the Association of Computing Machinery (ACM) Workshop on Algorithms and Architectures for Self-Managing Systems San Diego, June 11, 2003 and posted at tesla.hpl.hp.com/self-manage03/Finals/demsky.ps.




Advertisements:



January 14/21, 2004

Page One

Quantum dice debut

Pressure shapes plastic

Software repairs itself on the go

Nanoparticle dyes boost storage

Briefs:
Fiber optics goes nano
Melted fibers make nano channels
Wet biochip preserves proteins
Nanotubes grown on plastic
Hardy molecule makes memory
Atoms make quantum coprocessor

News:

Research News Roundup
Research Watch blog

Features:
View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 



Ad links:
Buy an ad link

Advertisements:







Ad links: Clear History

Buy an ad link

 
Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN


© Copyright Technology Research News, LLC 2000-2006. All rights reserved.