You are here

Feed aggregator

Open Library Data Additions: Amazon Crawl: part ch

planet code4lib - Mon, 2016-04-04 00:42

Part ch of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

LibUX: Activity Impact Score

planet code4lib - Sun, 2016-04-03 16:29

We get that page speed matters, but practically optimizing a website for performance is hardly so straightforward as “let’s make this shit blazing fast.” There’s more to it. We know, for instance, that the order in which elements load may matter more than just the total page load time – but even this can be pretty hard. These efforts accrue real technical debt, which means they cost real money. For folks where budgets and talents and times are constrained, we need to be able to determine where cranking that speedometer has the most bang for its buck, where speed matters most, and where it doesn’t (gasp).

The Activity Impact Score introduced by Tammy Everts for Soasta measures what impact page speed has on the length of time people spend on your site. This compares a performance metric 1 like load time in milliseconds with session length, because this can be a useful indicator that people are consuming content, wherein longer sessions mean likelier discovery of new events, new and old services, cool repos, archives – all the myriad things — let’s say — that libraries do and their patrons forget.

Pages are grouped into content types 2 — lists, events, searches, landing pages — and the proportion of the overall requests associated with that group is used with the Spearman Ranked Correlation between their load times and the user’s session length to calculate an activity impact score 3 on a scale between -1 — low impact — and 1 — high impact.

The bar chart represents relative activity impact scores and the line represents load time in milliseconds.

Higher scores (the homepage, search, and subject guides) demonstrate greater correlation between page speed and session length. So we can use the example above to determine that our “about” page group — informational pages where I threw-in parking, policies, and the like — has a relatively low activity impact score despite fast load times, so these kinds of pages don’t benefit all that much from really cranking it up.

And although we might hear the carrion call of those databases, baking pitifully in the lag desert, the score of our homepage proves its speed has way more impact on the length of time people are hanging around. Our time, then, is better off doting there, leaving poorer scorers to choke in the dust a little bit longer.

  1. I wrote a thing for Weave: Journal of Library User Experience about Meaningfully Judging Performance in Terms of User Experience.
  2. With some headscratching I managed to group pages with Google Analytics, but I couldn’t say whether a tool like mPulse (by the folks who brought you the Activity Impact Score) wouldn’t be easier.
  3. The activity impact score uses a similar method as the conversion impact score, where Tammy explains this better.

The post Activity Impact Score appeared first on LibUX.

Patrick Hochstenbach: Brush Inking Exercise

planet code4lib - Sun, 2016-04-03 11:00
Filed under: portaits, Sketchbook Tagged: brush, girl, ink, pen, portrait, ski, sktchy

Open Library Data Additions: Amazon Crawl: part ib

planet code4lib - Sun, 2016-04-03 07:16

Part ib of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Patrick Hochstenbach: Brush Inking Exercise

planet code4lib - Sat, 2016-04-02 07:13
Filed under: portaits, Sketchbook Tagged: art, brush, illustration, ink, portrait, sktchy

Ed Summers: Follow

planet code4lib - Sat, 2016-04-02 04:00

I was just testing a new release of twarc and was alerted to a test failure from Travis that seems to point to a change in Twitter’s API. I thought I would quickly write it up because it seemed like an interesting bit of Twitter API arcana, as well as an unexpected use of black box testing to detect changes in social media platforms like Twitter. It also highlights (for me) the benefit of sometimes being lazy and not using mock objects in automated tests.

The specific test that is failing is test_follow which uses Twitter’s follow streaming API request to watch for tweets from a handful of major news organizations like @guardian, @nytimes, @cnnbreak, etc. Once the test gets a tweet it examines the JSON and simply verifies that it came from one of the followed accounts. I added the test to make sure twarc was doing things correctly more than to test Twitter’s API.

The weird thing is that the test has recently started to fail because sometimes it gets back tweets that don’t appear to be sent from any of the followed accounts. The tweets don’t even appear to be retweets or quotes of those accounts either. As you can see from the JSON included at the bottom of the test failure the tweet that is returned is a retweet from @Margaritin22 (971990761) of a tweet by @nytimesbusiness (1754641) but the test wasn’t following either of those users…but it was following @nytimes (807095).

So it looks as if Twitter’s follow API now has some (new?) logic that provides @nytimesbusiness tweets because I am following tweets from @nytimes. Perhaps this is some new marketing feature that allows media outlets like the New York Times to pay Twitter to bundle accounts together for promotional purposes? Or perhaps there is an algorithm at play that assumes that since I am following @nytimes I would be interested in @nytimesbusiness?

The test has only been in use since December 2015 which really isn’t too long ago. Plus the tests normally only run when I’m actively developing twarc or when I push to GitHub. So perhaps this behavior isn’t new and it has just been lucky that the test hasn’t failed until now?

But this follow behavior feels a bit like some other changes that Twitter has made to the timeline where tweets are suggested based on popularity. If anyone has any insight into this please let me know.

For the moment the test has stopped following @nytimes and it appears to be working again…at least for now. Here’s a musical interlude since you made this far:

Open Library Data Additions: Amazon Crawl: part eg

planet code4lib - Fri, 2016-04-01 23:46

Part eg of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Alf Eaton, Alf: Distributed Consensus

planet code4lib - Fri, 2016-04-01 21:52

In 1955, Isaac Asimov published a short story titled "Franchise", about a system that decides who should be elected president (in 2008) by picking a single voter to represent the whole population.

If a single voter is regularly selected at random then, over time, a larger, more representative sample of the population will build up. Any inconsistencies in single voters will be rejected by later consensus.

Distributed systems say "after a certain amount of time, enough votes will have been cast to be sure enough of a consensus".

Each voter must be selected at random, but if this selection is performed by a central machine, that machine must be trusted. To avoid this, everyone in the system is given a task that is guaranteed to give each participant an equal chance of completing first - a chance which is increased only by how much work they do.

As a reward for participating, anyone who casts a vote receives a monetary payment.

The system will, generally, consume energy up to the value of the reward for casting each vote.

To save energy, votes can instead be given to those who have purchased the most shares (stake) in the system (i.e. a single up-front payment, rather than an ongoing subscription).

Another alternative, valid for small populations, is to collect the sample in a single poll: invite all members to participate, and generate the consensus after a certain amount of time has passed. This works when most of the members are known to be good actors, so in order to avoid a small, good population being overwhelmed by new, bad actors, new members need to gradually build up reputation in order to vote.

District Dispatch: E-rate does not suffer fools

planet code4lib - Fri, 2016-04-01 19:21

FCC building in Washington, D.C.

Today, ALA submitted a request to the Federal Communications Commission (FCC) to extend the 2016 E-rate application window.  This year’s application process has proven challenging, in large part due to the new online filing system, the E-rate Productivity Center (EPC). While the online system has a number of benefits to the applicant, the roll-out of this entirely new system has raised a number of issues that have been difficult for library applicants to overcome.

The Schools and Libraries Division of the Universal Service Administrative Company (USAC), which administers the E-rate program on behalf of the FCC, has made extra efforts to address issues libraries have come up against. They have dedicated countless staff hours and have increased regular communications and outreach to the library community. Despite these significant extra efforts, issues remain that in some instances have made progress in the application process come to a halt.

With only four weeks left in the current window, we, along with the E-rate Task Force, felt compelled to call attention to the possibility that some libraries would be severely disadvantaged. Given the tremendous opportunity following the FCC’s Modernization proceeding, including funding for Category 2 services, this is an important year for libraries to participate in the E-rate program. There are real opportunities to upgrade old internal networks and to plan for larger upgrade projects.

Change of the magnitude USAC has had to incur will no doubt take time to implement. By extending the application window, we hope the issues that confront libraries can be ameliorated and that the extra time will enable libraries to successfully apply.

We don’t know when the FCC would make a decision, but with the clock ticking, we are hopeful it will be timely. In the meantime, applicants have a host of resources at their disposal. Each state has an E-rate coordinator who is well versed in the intricacies of the E-rate program and works with local libraries in their respective states. The E-rate Task Force recently created an E-rate clearinghouse for libraries, Libraryerate, which is a peer resource sharing portal. Finally the SLD of USAC is the ultimate resource for program rules, forms, and guidance. Their website includes a “File Along with Me” blog with tips for successfully navigating the EPC portal and if you haven’t yet, you can sign up for the weekly News Brief for the latest information.

The post E-rate does not suffer fools appeared first on District Dispatch.

LITA: Look who’s talking: Conducting a needs assessment project to inform your service design

planet code4lib - Fri, 2016-04-01 18:11

If you can’t tell, I’m on a research data services kick of late, mostly because we’re in the throes of trying to define our service model and move some of our initiatives forward all while building new partnerships.

What I didn’t mention in my previous post is all the lead-up work we’re doing to lay the groundwork for those awesome services I discussed. And there is quite a bit to do in that regard, so I thought it would be helpful to provide some tips on what you can do to set the stage for a successful launch of these types of services. Here goes!

If you have a specific population/audience in mind for your services, getting feedback from them is essential. This can take many forms, although we tend to rely on the tried and true (and often dreaded survey). Which is great if you want to collect a high amount of data that may or may not lead to follow-up questions. But what if you want to do something a little different?

  • Getting started

If you want to publish or share your results, get Institutional Review Board (IRB) clearance first. This is a pain, and it often falls in the Exempt category, but because there are people involved in your research project, it’s best to get the green light from the IRB board so you don’t have to worry about it later. This will entail filling out forms describing your project, how you will collect and manage the data, and how you will ensure compliance with human subject research protocol such as confidentiality. Prior to submitting something to IRB, the principal investigator will have to complete CITI training or something similar to verify his/her understanding of the processes involved.

  • Let me count the ways


Decide how you want to conduct your needs assessment. Each methodology has its pros and cons. I mentioned surveys are always popular and they tend to yield high numbers. But the drawback is that you cannot ask for clarification, participants have a limited number of choices (especially if you have a lot of multiple choice questions), and you have to design your questions very carefully so that they are clear, and are asking what you really need to know, otherwise the results could be skewed or meaningless, or both.

Interviews are great if you want to gather qualitative data and don’t mind reaching smaller numbers, but having more in-depth information might be more useful for your purposes. As with surveys, IRB will require that you have a clear script in place and that you ask the same questions every time so you will need to make sure you have this information ahead of time. Sending teams of two to each interview might be helpful so that you have two sets of note-takers who can catch different things and can cover for each other if something unexpected comes up. If you plan to record a conversation, this will need additional clearance from IRB and you will have to make sure you have a clear process in place for starting and stopping the recording and letting participants know they are going to be recorded or taped.

Another option is to conduct focus groups. This can also take various forms, everything from asking questions, to leading participants through a design process as part of a design thinking activity, or simply asking for feedback in reaction to a prototype of some sort. You will have to make sure you recruit a representative group, have a location, a clearly established process, and a way to guide the conversation as it unfolds in addition to capturing what was said.

A final alternative is to conduct ethnographic and participatory research. Instead of simply asking a question, you are letting your audience tell you what they want or expect for a specific service. In other words, they are taking an active part in the design process itself. Nancy Fried Foster is an expert in this area, and I highly recommend looking at her work if you’re interested in this methodology. Having participants draw a picture of their “ideal” space or service can lead to some fun conversations!

  • Who’s on first

Who will conduct the assessment? This may be everyone in a specific unit or department, a handful of people, or even just one person. Your methodology will influence the number of data collectors needed. You will also want to think about any training the group will undertake as part of these activities. Especially if you’re collecting data in a more qualitative format, you will want to ensure that everyone is doing this in as uniform a fashion as possible and you may need several training sessions to prepare.

  • The right stuff

Have all your materials ready ahead of time, especially if they involve asking specific questions, or having participants walk through a set of prescribed activities. Make sure you have instructions clearly spelled out and provide handouts for anything that requires a deeper explanation.

  • Getting Organized

Schedules are tough to organize even for internal meetings, let alone with others on campus, so having a form where participants can designate their preferred time or fill in one of their own is much easier than playing email ping-pong in nailing down a date and time.

  • Marketing is key

We found out the hard way that one approach is not always ideal. We sent out a mass email to faculty only to receive two responses. When liaisons sent out the same exact message, we saw an immediate increase in numbers. Make sure you explain the purpose of the research and make it as easy to indicate willingness to participate as possible. The source of the message counts as well-an email from a generic library account may not garner much attention, but a forwarded message from a department head might do the trick.

  • Data analysis and dissemination

I won’t get too much into the weeds of how to analyze the data you collect, except to say that you will need to set aside ample time for this activity. Once the results are compiled and you have your action items identified, make sure you share the results back with the participants so that you can show them the product of their involvement no matter how small. This will go a long way towards ensuring that they will actually use the bright, shiny new services you create based on their input.

  • Follow-through

Whatever you do, make sure you do something! There’s nothing worse than collecting valuable (hopefully) information only to have it sitting dormant for months on end because this wasn’t high on someone’s priority list. Make sure you have the commitment and resources you need before you begin the project so that you can implement the ideas that emerge as a result in a timely manner.

Open Library Data Additions: Amazon Crawl: part fa

planet code4lib - Fri, 2016-04-01 15:30

Part fa of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

OCLC Dev Network: SPARQL Tips Tricks and Tools

planet code4lib - Fri, 2016-04-01 14:00

Learn some useful trips and tricks to use SPARQL effectively

Information Technology and Libraries: Transitioning from XML to RDF: Considerations for an effective move towards Linked Data and the Semantic Web

planet code4lib - Fri, 2016-04-01 11:26
Metadata, particularly within the academic library setting, is often expressed in eXtensible Markup Language (XML) and managed with XML tools, technologies, and workflows. Managing a library’s metadata currently takes on a greater level of complexity as libraries are increasingly adopting the Resource Description Framework (RDF). Semantic Web initiatives are surfacing in the library context with experiments in publishing metadata as Linked Data sets and also with development efforts such as BIBFRAME and the Fedora 4 Digital Repository incorporating RDF. Use cases show that transitions into RDF are occurring in both XML standards and in libraries with metadata encoded in XML. It is vital to understand that transitioning from XML to RDF requires a shift in perspective from replicating structures in XML to defining meaningful relationships in RDF. Establishing coordination and communication among these efforts will help as more libraries move to use RDF, produce Linked Data, and approach the Semantic Web.

Information Technology and Libraries: Fulfill Your Digital Preservation Goals with a Budget Studio

planet code4lib - Fri, 2016-04-01 11:26

In order to fulfill digital preservation goals, many institutions use high-end scanners for in-house scanning of historical print and oversize materials. However, high-end scanners’ prices do not fit in many small institutions’ budget. As digital single-lens reflex (DSLR) camera technologies advance and camera prices drop quickly, a budget photography studio can help to achieve institutions’ preservation goals.  This paper compares images delivered by a high-end overhead scanner and a consumer level DSLR camera, discusses pros and cons of using each method, demonstrates how to set up a cost efficient shooting studio, and presents a budget estimate for a studio.

Information Technology and Libraries: Editorial Board Thoughts: The Importance of Staff Change Management in the Face of the Growing “Cloud”

planet code4lib - Fri, 2016-04-01 11:26
Editorial Board Thoughts: The Importance of Staff Change Management in the Face of the Growing “Cloud”

Information Technology and Libraries: Lessons Learned: A Primo Usability Study

planet code4lib - Fri, 2016-04-01 11:26

The University of Houston Libraries implemented Primo as the primary search option on the library website in May 2014. In May 2015, the Libraries released a redesigned interface to improve user experience with the tool. The Libraries took a user-centered approach to redesigning the Primo interface by conducting a "think-aloud" usability test in order to gather user feedback and identify needed improvements. This article describes the methodology and findings from the usability study, the changes that were made to the Primo interface as a result, and implications for discovery system vendor relations and library instruction.

Patrick Hochstenbach: Comics Art in Relationship

planet code4lib - Fri, 2016-04-01 05:10
Homework for the California College of the Arts online course: create a design based on a given existing script. There is ya monster…  Filed under: Comics Tagged: art, comic, election, ink, lineart, pen, politics, trump

Eric Hellman: April Fools is Cancelled This Year

planet code4lib - Fri, 2016-04-01 04:14
Since the Onion dropped their fake news format in January in favor of serious reporting, it's become clear that the web's April Fools Day would be very different this year. Why make stuff up when real life is so hard to believe?

All my ideas for a satirical blog posts seemed too sadly realistic. After people thought my April 1 post last year was real, all my ideas for fake posts about false privacy and the All Writs Act seemed cruel. I thought about doing something about power inequity in libraries and publishing, but then all my crazy imaginings came true on the ACRL SCHOLCOMM list.

So no April Fools post on Go To Hellman this year. Except for this one, of course.

Cynthia Ng: Publishers and the Print Disabled in Canada: Some Get It, Some Don’t

planet code4lib - Thu, 2016-03-31 22:22
It’s no secret that the print-disabled are a under-supported group. While those who are not print challenged have are able to read all the literature that we understand, print-disabled readers only have access to a small percentage (1-7%) of the world’s published books. There are many efforts underway with: legislation (namely, Marrakesh Treaty), many existing … Continue reading Publishers and the Print Disabled in Canada: Some Get It, Some Don’t

Nick Ruest: 1,203,867 #elxn42 images

planet code4lib - Thu, 2016-03-31 21:23
1,203,867 #elxn42 images Background

Last August, I began capturing the #elxn42 hashtag as an experiment, and potential research project with Ian Milligan. Once Justin Trudeau was sworn in as the 23rd Prime Minister of Canada, we stopped collection, and began analysing the dataset. We wrote that analysis up for the Code4Lib Journal, which will be published in the next couple weeks. In the interim, you can check out our pre-print here. Included in that dataset is a line-deliminted list of a url to every embedded image tweeted in the dataset; 1,203,867 images. So, I downloaded them. It took a couple days.


IMAGES=/path/to/elxn42-image-urls.txt cd /path/to/elxn42/images cat $IMAGES | while read line; do wget "$line" done

Now we can start doing image analysis.

1,203,867 images, now what?

I really wanted to take a macroscopic look at all the images, and looking around the best tool for the job looked like montage, an ImageMagick command for creating composite images. But, it wasn't that so simple. 1,203,867 images is a lot of images, and starts getting you thinking about what big data is. Is this big data? I don't know. Maybe?

Attempt #1

I can just point montage at a directory and say go to town, right? NOPE.

$ montage /path/to/1203867/elxn42/images/* elxn42.png

Too many arguments! After glancing through the man page, I find that I can pass it a line-delimited text file with the paths to each file.

file paths find `pwd` -type f -exec cat {} > images.txt

Now that I have that, I can pass montage that file, and I should be golden, right? NOPE.

$ montage @images.txt elxn42.png

I run out of RAM, and get a segmentation fault. This was on a machine with 80GB of RAM.

Attempt #2

Is this big data? What is big data?

Where can I get a machine with a bunch of RAM really quick? Amazon!

I spin up a d2.8xlarge (36 cores and 244GB RAM) EC2 instance, get my dataset over there, ImageMagick installed, and run the command again.

$ montage @images.txt elxn42.png

NOPE. I run out of RAM, and get a segmentation fault. This was on a machine with 244GB of RAM.

Attempt #3

Is this big data? What is big data?

I've failed on two very large machines. Well, what I would consider large machines. So, I start googling, and reading more ImageMagick documentation. Somebody has to have done something like this before, right? Astronomers, they deal with big images right? How do they do this?

Then I find it; ImageMagick Large Image Support/Terapixel support, and the timing couldn't have been better. Ian and I had recently got setup with our ComputeCanada resource allocation. I setup a machine with 8 cores, 12GB RAM, and compiled the latest version of ImageMagick from source; ImageMagick-6.9.3-7.

montage -monitor -define registry:temporary-path=/data/tmp -limit memory 8GiB -limit map 10GiB -limit area 0 @elxn42-tweets-images.txt elxn42.png

Instead of running everything in RAM, which became my issue with this job, I'm able to write all the tmp files ImageMagick creates to disk with -define registry:temporary-path=/data/tmp and limit my memory usage with -limit memory 8GiB -limit map 10GiB -limit area 0. Then knowing this job was going to probably take a long time, -monitor comes in super handy for providing feedback of where the job is at process-wise.

In the end, it took just over 12 days to run the job. It took up 3.5TB of disk space at its peak, and in the end generated a 32GB png file. You can check it out here.

$ pngcheck elxn42.png OK: elxn42.png (138112x135828, 48-bit RGB, non-interlaced, 69.6%). $ exiftool elxn42.png ExifTool Version Number : 9.46 File Name : elxn42.png Directory : . File Size : 32661 MB File Modification Date/Time : 2016:03:30 00:48:44-04:00 File Access Date/Time : 2016:03:30 10:20:26-04:00 File Inode Change Date/Time : 2016:03:30 09:14:09-04:00 File Permissions : rw-rw-r-- File Type : PNG MIME Type : image/png Image Width : 138112 Image Height : 135828 Bit Depth : 16 Color Type : RGB Compression : Deflate/Inflate Filter : Adaptive Interlace : Noninterlaced Gamma : 2.2 White Point X : 0.3127 White Point Y : 0.329 Red X : 0.64 Red Y : 0.33 Green X : 0.3 Green Y : 0.6 Blue X : 0.15 Blue Y : 0.06 Background Color : 65535 65535 65535 Image Size : 138112x135828 Concluding Thoughts

Is this big data? I don't know. I started with 1,203,867 images and made it into a single image. Using 3.5TB of tmp files to create a 32GB image is mind boggling when you start to think about it. But then it isn't when you think about it more. Do I need a machine with 3.5TB of RAM to run this in memory? Or do I just need to design a job with the resources I have and be patient. There are always trade-offs. But, at the end of it all, I'm still sitting here asking myself what is big data?

Maybe this is big data :-)

I extracted every image in the 4.1TB GeoCities WARC collection and you won’t believe what I found next!

(me neither… in short: too many!)

— Ian Milligan (@ianmilligan1) March 31, 2016

@ianmilligan1 so, we're going to montage these, right!?

— nick ruest (@ruebot) March 31, 2016 tags: elxn42ImageMagicktwitterbig data


Subscribe to code4lib aggregator