
Google just published two patent applications.
US2008/0107337 is “Methods and Systems for Analyzing Data in Media Material Having Layout” and
US2008/0107338 is “Media Material Analysis of Continuing Article Portions”. You can view them at
USPTO.
Both inventions, to which Google is the assignee, pertain to figuring out what’s important and what’s not on Web pages. Companies that scan hard copy and convert those images to machine-readable ASCII use some tricks but a great deal of brute force to figure out what’s information and what’s advertising or other dross.
thanks to
Stephen Arnold for the information.
Labels: Analyze, Research